Backup & Recovery¶
Backup Architecture¶
graph LR
PVC[PersistentVolumeClaims] -->|VolSync| Kopia[Kopia Repository]
Kopia -->|S3 API| Garage[Garage S3]
PG[PostgreSQL WAL] -->|barman-cloud| Garage
MDB[MariaDB Galera] -->|mysqldump| Garage
Git[Git Repository] -->|GitOps| State[Cluster State]
The cluster uses a layered backup strategy:
| Data Type | Backup Method | Destination | Schedule |
|---|---|---|---|
| Application config (PVCs) | VolSync + Kopia | Garage S3 | Daily at 2 AM |
| PostgreSQL databases | barman-cloud (WAL + base backups) | Garage S3 | Continuous WAL + scheduled |
| MariaDB databases | mysqldump (mariadb-operator Backup CR) | Garage S3 | Every 6 hours |
| Cluster state | Git repository | GitHub | On every push |
| Secrets | SOPS-encrypted in Git + 1Password | GitHub + 1Password | On every push |
VolSync¶
VolSync replicates PersistentVolumeClaims to S3-compatible storage.
Configuration Details¶
The VolSync component at kubernetes/components/volsync/ provides reusable backup/restore templates:
| Setting | Value |
|---|---|
| Schedule | 0 2 * * * (daily at 2 AM UTC) |
| Compression | zstd-fastest |
| Copy method | Direct |
| Parallelism | 2 threads |
| Cache storage | openebs-hostpath (5Gi) |
| Mover user | UID/GID 1000 |
Retention policy:
- 24 hourly, 7 daily, 4 weekly, 6 monthly, 2 yearly
Applying VolSync to an Application¶
Reference the VolSync component in your app's ks.yaml:
VolSync uses the ${APP} variable (from Flux substitution) to name resources. Each app gets its own ReplicationSource and Kopia secret.
Checking Backup Status¶
# List all backup sources and their last sync time
kubectl get replicationsource -A
# List all restore destinations
kubectl get replicationdestination -A
# Detailed status for a specific app
kubectl -n <namespace> describe replicationsource <app-name>
Restoring from a VolSync Backup¶
Warning
Restoring overwrites the existing PVC data. Ensure you understand the implications before proceeding.
-
Scale down the application to release the PVC:
-
Trigger the restore by annotating the
ReplicationDestination: -
Wait for the restore to complete:
-
Resume the application:
-
Verify the application is running with restored data:
Mass Point-in-Time Restore¶
For disaster recovery scenarios where all VolSync-backed applications need to be restored simultaneously, use the automated mass restore script:
Warning
This script restores all 16 VolSync-backed applications at once. Ensure you understand the implications before running it.
Supported applications (16 total):
| Namespace | Applications |
|---|---|
| media | autobrr, bazarr, plex, prowlarr, qbittorrent, radarr, recyclarr, seerr, sonarr, tautulli, thelounge, qui |
| network | unifi-toolkit |
| observability | gatus |
| utils | forgejo, penpot |
How it works:
- Suspends all Flux Kustomizations for the target apps
- Scales down workloads (handles both Deployments and StatefulSets)
- Patches each
ReplicationDestinationwith arestoreAsOftimestamp and manual trigger - Waits for all restores to complete (20-minute timeout per app)
- Resumes Flux Kustomizations on success
Configuration: Edit the RESTORE_TIME variable at the top of the script to set the desired point-in-time (RFC3339 format, e.g., 2026-03-01T23:59:59Z).
Notes:
- Handles multi-controller apps (e.g., penpot with separate frontend/backend deployments)
- Prevents
kubectlhangs when pods are already absent - Reports failures at the end with a summary of which apps failed
PostgreSQL Backups¶
CloudNative-PG handles PostgreSQL backups independently via the barman-cloud plugin:
- Continuous WAL archiving to Garage S3 (enables point-in-time recovery)
- Scheduled base backups with configurable retention
- S3 bucket:
cnpg-garage - Recovery cluster definition at
kubernetes/apps/database/cloudnative-pg/recovery/
Triggering a Manual Backup¶
kubectl -n database create -f - <<EOF
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
name: manual-backup-$(date +%Y%m%d%H%M)
spec:
cluster:
name: postgres
method: barmanObjectStore
EOF
Checking Backup Status¶
# List all backups
kubectl -n database get backups
# List scheduled backups
kubectl -n database get scheduledbackups
# Check backup details
kubectl -n database describe backup <backup-name>
# Check WAL archiving status
kubectl -n database get cluster postgres -o jsonpath='{.status.firstRecoverabilityPoint}'
MariaDB Backups¶
The mariadb-operator handles MariaDB Galera backups via its Backup custom resource:
- Scheduled mysqldump backups to Garage S3 every 6 hours
- Compression: bzip2
- Retention: 30 days
- S3 bucket:
mariadb-backups(prefixgalera) - Backup definition at
kubernetes/apps/database/mariadb-operator/cluster/backup.yaml
Checking Backup Status¶
# List all MariaDB backups
kubectl -n database get backups.k8s.mariadb.com
# Check backup details
kubectl -n database describe backup mariadb-backup
Restoring from a MariaDB Backup¶
To restore from S3, create a new MariaDB CR with bootstrapFrom referencing the backup:
apiVersion: k8s.mariadb.com/v1alpha1
kind: MariaDB
metadata:
name: mariadb-recovery
spec:
bootstrapFrom:
backupRef:
name: mariadb-backup
# ... same spec as production cluster
Disaster Recovery¶
Full Cluster Recovery¶
Since the cluster is GitOps-managed, a full recovery follows these steps:
-
Prepare hardware — Boot new nodes with Talos Linux (see Machine Preparation)
-
Bootstrap the cluster:
-
Flux restores application state — All manifests are pulled from Git automatically
-
VolSync restores PVC data — Application data is restored from Garage S3 backups
-
PostgreSQL recovers from WAL archives — Use the recovery cluster definition (see below)
-
MariaDB recovers from S3 backups — Create a recovery MariaDB CR with
bootstrapFrom(see above) -
Verify all services:
PostgreSQL Point-in-Time Recovery¶
The recovery cluster definition at kubernetes/apps/database/cloudnative-pg/recovery/cluster.yaml bootstraps a new PostgreSQL cluster from S3 backups:
| Setting | Value |
|---|---|
| Instances | 2 (reduced for recovery, scale up after) |
| PostgreSQL | 17.7 (matches production) |
| Source | postgres-backup (Garage S3 via barman-cloud) |
| S3 bucket | cnpg-garage |
Recovery steps:
-
Apply the recovery cluster (modify target time if needed for PITR):
-
Monitor recovery progress:
-
Verify data integrity once the cluster is ready:
-
Promote the recovery cluster to production (update the main cluster definition to point to the recovered data, or rename the recovery cluster).
Tip
For point-in-time recovery, add a recoveryTarget to the recovery cluster spec:
What's Backed Up vs. Not¶
| Backed up | Not backed up (ephemeral) |
|---|---|
| Application PVCs (via VolSync) | Active VM state (VMs restart from disk images) |
| PostgreSQL databases (via WAL archiving) | In-memory caches (Dragonfly data) |
| MariaDB databases (via scheduled mysqldump) | Real-time metrics (Prometheus TSDB rebuilds) |
| Cluster manifests (in Git) | Pod logs (Victoria Logs rebuilds from Fluent Bit) |
| Secrets (SOPS in Git + 1Password) |
Garage S3¶
Garage provides the S3-compatible storage backend:
- Self-hosted within the cluster (
kubernetes/apps/volsync-system/garage/) - Stores both VolSync and PostgreSQL backups
- Lightweight and resource-efficient
- Compatible with standard S3 clients and tools