Skip to main content

Backup & Recovery

This guide covers backup strategies, disaster recovery procedures, and restore workflows for STOA Platform.

What to Back Up​

ComponentDataMethodFrequency
PostgreSQLAPIs, subscriptions, consumers, tenantspg_dumpDaily
KeycloakRealm config, users, clientsRealm exportWeekly
Infisical/VaultPlatform secretsProvider backupWeekly
Helm valuesDeployment configurationGit (IaC)Every change
CRDsTool, ToolSet, GatewayInstance, GatewayBindingkubectl get -o yamlDaily
GrafanaDashboards, datasourcesJSON export or GitEvery change

What Does NOT Need Backup​

  • STOA Gateway state: In-memory, reconstructed from Control Plane on restart
  • Prometheus metrics: Ephemeral by design (configure retention instead)
  • Container images: Stored in GHCR, reproducible from source

PostgreSQL Backup​

Automated Daily Backup​

Create a CronJob for automated backups:

apiVersion: batch/v1
kind: CronJob
metadata:
name: pg-backup
namespace: stoa-system
spec:
schedule: "0 2 * * *" # Daily at 2 AM UTC
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:15-alpine
command:
- /bin/sh
- -c
- |
pg_dump -h $PGHOST -U $PGUSER -d $PGDATABASE \
--format=custom \
--compress=9 \
-f /backups/stoa-$(date +%Y%m%d-%H%M%S).dump
envFrom:
- secretRef:
name: postgres-credentials
volumeMounts:
- name: backup-storage
mountPath: /backups
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: pg-backups-pvc
restartPolicy: OnFailure

Manual Backup​

Configure your environment

The examples below use environment variables. Set them for your STOA instance:

export STOA_API_URL="https://api.gostoa.dev"       # Replace with your domain
export STOA_AUTH_URL="https://auth.gostoa.dev" # Keycloak OIDC provider
export STOA_GATEWAY_URL="https://mcp.gostoa.dev" # MCP Gateway endpoint

Self-hosted? Replace gostoa.dev with your domain.

# Backup
kubectl exec -n stoa-system deploy/postgres -- \
pg_dump -U stoa -d stoa --format=custom --compress=9 \
> stoa-backup-$(date +%Y%m%d).dump

# Verify backup integrity
pg_restore --list stoa-backup-*.dump | head -20

Restore​

# Restore from backup (replaces all data)
kubectl exec -i -n stoa-system deploy/postgres -- \
pg_restore -U stoa -d stoa --clean --if-exists \
< stoa-backup-20260213.dump

Keycloak Backup​

Realm Export​

# Export realm configuration
kubectl exec -n stoa-system deploy/keycloak -- \
/opt/keycloak/bin/kc.sh export \
--dir /tmp/export \
--realm stoa \
--users realm_file

# Copy export locally
kubectl cp stoa-system/keycloak-0:/tmp/export/stoa-realm.json ./keycloak-backup.json

Realm Import (Restore)​

# Import realm from backup
kubectl cp ./keycloak-backup.json stoa-system/keycloak-0:/tmp/import/stoa-realm.json

kubectl exec -n stoa-system deploy/keycloak -- \
/opt/keycloak/bin/kc.sh import \
--dir /tmp/import \
--override true

CRD Backup​

# Export all STOA CRDs
for crd in tools toolsets gatewayinstances gatewaybindings; do
kubectl get ${crd}.gostoa.dev -A -o yaml > ${crd}-backup.yaml
done

Restore CRDs​

kubectl apply -f tools-backup.yaml
kubectl apply -f toolsets-backup.yaml
kubectl apply -f gatewayinstances-backup.yaml
kubectl apply -f gatewaybindings-backup.yaml

Disaster Recovery​

Recovery Time Objectives​

ScenarioRTORPOProcedure
Single pod failureImmediate0K8s self-healing (replicas)
Node failure5 min0K8s rescheduling
Database corruption30 min24hRestore from pg_dump
Full cluster loss2-4h24hNew cluster + restore all
Region outage4-8h24hFailover to secondary region

Full Cluster Recovery Procedure​

  1. Provision new cluster (Helm or Terraform)
  2. Restore PostgreSQL from latest backup
  3. Import Keycloak realm from export
  4. Apply CRDs from backup
  5. Install Helm chart with saved values
  6. Verify gateway health and re-sync APIs
  7. Update DNS to point to new cluster

Verification Checklist​

After any restore:

  • kubectl get pods -n stoa-system β€” all pods Running
  • Control Plane API responds on /health
  • Gateway responds on /health
  • Keycloak login works
  • API catalog shows correct data
  • Subscriptions are intact
  • Prometheus scraping resumes

Retention Policy​

Backup TypeRetentionStorage
Daily PostgreSQL30 daysPVC or object storage
Weekly Keycloak export90 daysGit or object storage
CRD snapshots30 daysGit
Helm valuesIndefiniteGit (IaC)