Backup & Recovery
This guide covers backup strategies, disaster recovery procedures, and restore workflows for STOA Platform.
What to Back Upβ
| Component | Data | Method | Frequency |
|---|---|---|---|
| PostgreSQL | APIs, subscriptions, consumers, tenants | pg_dump | Daily |
| Keycloak | Realm config, users, clients | Realm export | Weekly |
| Infisical/Vault | Platform secrets | Provider backup | Weekly |
| Helm values | Deployment configuration | Git (IaC) | Every change |
| CRDs | Tool, ToolSet, GatewayInstance, GatewayBinding | kubectl get -o yaml | Daily |
| Grafana | Dashboards, datasources | JSON export or Git | Every change |
What Does NOT Need Backupβ
- STOA Gateway state: In-memory, reconstructed from Control Plane on restart
- Prometheus metrics: Ephemeral by design (configure retention instead)
- Container images: Stored in GHCR, reproducible from source
PostgreSQL Backupβ
Automated Daily Backupβ
Create a CronJob for automated backups:
apiVersion: batch/v1
kind: CronJob
metadata:
name: pg-backup
namespace: stoa-system
spec:
schedule: "0 2 * * *" # Daily at 2 AM UTC
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15-alpine
command:
- /bin/sh
- -c
- |
pg_dump -h $PGHOST -U $PGUSER -d $PGDATABASE \
--format=custom \
--compress=9 \
-f /backups/stoa-$(date +%Y%m%d-%H%M%S).dump
envFrom:
- secretRef:
name: postgres-credentials
volumeMounts:
- name: backup-storage
mountPath: /backups
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: pg-backups-pvc
restartPolicy: OnFailure
Manual Backupβ
Configure your environment
The examples below use environment variables. Set them for your STOA instance:
export STOA_API_URL="https://api.gostoa.dev" # Replace with your domain
export STOA_AUTH_URL="https://auth.gostoa.dev" # Keycloak OIDC provider
export STOA_GATEWAY_URL="https://mcp.gostoa.dev" # MCP Gateway endpoint
Self-hosted? Replace gostoa.dev with your domain.
# Backup
kubectl exec -n stoa-system deploy/postgres -- \
pg_dump -U stoa -d stoa --format=custom --compress=9 \
> stoa-backup-$(date +%Y%m%d).dump
# Verify backup integrity
pg_restore --list stoa-backup-*.dump | head -20
Restoreβ
# Restore from backup (replaces all data)
kubectl exec -i -n stoa-system deploy/postgres -- \
pg_restore -U stoa -d stoa --clean --if-exists \
< stoa-backup-20260213.dump
Keycloak Backupβ
Realm Exportβ
# Export realm configuration
kubectl exec -n stoa-system deploy/keycloak -- \
/opt/keycloak/bin/kc.sh export \
--dir /tmp/export \
--realm stoa \
--users realm_file
# Copy export locally
kubectl cp stoa-system/keycloak-0:/tmp/export/stoa-realm.json ./keycloak-backup.json
Realm Import (Restore)β
# Import realm from backup
kubectl cp ./keycloak-backup.json stoa-system/keycloak-0:/tmp/import/stoa-realm.json
kubectl exec -n stoa-system deploy/keycloak -- \
/opt/keycloak/bin/kc.sh import \
--dir /tmp/import \
--override true
CRD Backupβ
# Export all STOA CRDs
for crd in tools toolsets gatewayinstances gatewaybindings; do
kubectl get ${crd}.gostoa.dev -A -o yaml > ${crd}-backup.yaml
done
Restore CRDsβ
kubectl apply -f tools-backup.yaml
kubectl apply -f toolsets-backup.yaml
kubectl apply -f gatewayinstances-backup.yaml
kubectl apply -f gatewaybindings-backup.yaml
Disaster Recoveryβ
Recovery Time Objectivesβ
| Scenario | RTO | RPO | Procedure |
|---|---|---|---|
| Single pod failure | Immediate | 0 | K8s self-healing (replicas) |
| Node failure | 5 min | 0 | K8s rescheduling |
| Database corruption | 30 min | 24h | Restore from pg_dump |
| Full cluster loss | 2-4h | 24h | New cluster + restore all |
| Region outage | 4-8h | 24h | Failover to secondary region |
Full Cluster Recovery Procedureβ
- Provision new cluster (Helm or Terraform)
- Restore PostgreSQL from latest backup
- Import Keycloak realm from export
- Apply CRDs from backup
- Install Helm chart with saved values
- Verify gateway health and re-sync APIs
- Update DNS to point to new cluster
Verification Checklistβ
After any restore:
-
kubectl get pods -n stoa-systemβ all pods Running - Control Plane API responds on
/health - Gateway responds on
/health - Keycloak login works
- API catalog shows correct data
- Subscriptions are intact
- Prometheus scraping resumes
Retention Policyβ
| Backup Type | Retention | Storage |
|---|---|---|
| Daily PostgreSQL | 30 days | PVC or object storage |
| Weekly Keycloak export | 90 days | Git or object storage |
| CRD snapshots | 30 days | Git |
| Helm values | Indefinite | Git (IaC) |
Relatedβ
- Installation Guide -- Helm chart deployment
- Upgrade Guide -- Version upgrades
- Configuration Reference -- Environment variables