Troubleshooting
Common issues encountered when running STOA, with diagnosis steps and solutions.
Installation & Deployment
Pods Not Starting
Symptom: Pods stuck in CrashLoopBackOff or Pending.
Diagnosis:
kubectl describe pod <pod-name> -n stoa-system
kubectl logs <pod-name> -n stoa-system --previous
Common causes:
| Cause | Fix |
|---|---|
| Missing secrets | Create required secrets: kubectl create secret generic stoa-db-credentials ... |
| Insufficient resources | Increase node resources or reduce replica count |
| Image pull failure | Verify image exists and pull secret is configured |
| CRDs not applied | Run kubectl apply -f charts/stoa-platform/crds/ |
Helm Install Fails
Symptom: helm upgrade --install returns an error.
# Check Helm release status
helm status stoa-platform -n stoa-system
# View rendered templates
helm template stoa-platform ./charts/stoa-platform -f values.yaml
# Lint the chart
helm lint charts/stoa-platform
Database
Connection Refused
Symptom: connection refused or could not connect to server in API logs.
Checklist:
- Verify database is running:
kubectl get pods -l app=postgresql - Check connection string:
DATABASE_URLmust include correct host, port, credentials - Check network policies: ensure
stoa-systemnamespace can reach the database - Check security groups (AWS): RDS must allow inbound from EKS nodes
Migration Failures
Symptom: API starts but returns 500 errors on data operations.
cd control-plane-api
alembic upgrade head
alembic current # Verify migration state
If migrations are stuck:
alembic stamp head # Reset to current state (use cautiously)
Authentication
The examples below use environment variables. Set them for your STOA instance:
export STOA_API_URL="https://api.gostoa.dev" # Replace with your domain
export STOA_AUTH_URL="https://auth.gostoa.dev" # Keycloak OIDC provider
export STOA_GATEWAY_URL="https://mcp.gostoa.dev" # MCP Gateway endpoint
export STOA_CONSOLE_URL="https://console.gostoa.dev"
Self-hosted? Replace gostoa.dev with your domain.
401 Unauthorized on All Requests
Checklist:
- Token expired? Re-authenticate:
stoa login --server ... - Keycloak reachable?
curl ${STOA_AUTH_URL}/health - Realm correct? Must be
stoa - Client ID correct?
control-plane-uifor Console,stoa-portalfor Portal - Clock skew? Ensure server and client clocks are synchronized
CORS Errors During Login
Symptom: Browser console shows CORS errors when redirecting to/from Keycloak.
Fix: In Keycloak admin:
- Open the client configuration (
control-plane-uiorstoa-portal) - Add the application URL to Web Origins (e.g.,
${STOA_CONSOLE_URL}) - Add to Valid Redirect URIs (e.g.,
${STOA_CONSOLE_URL}/*)
Token Missing Scopes
Symptom: 403 Forbidden despite being authenticated.
Fix: Ensure the user belongs to the correct Keycloak group:
| Role | Group | Scopes |
|---|---|---|
| Platform admin | cpi-admin | stoa:admin, stoa:write, stoa:read |
| Tenant admin | tenant-admin | stoa:write, stoa:read |
| DevOps | devops | stoa:write, stoa:read |
| Viewer | viewer | stoa:read |
MCP Gateway
Tools Not Appearing
Symptom: tools/list returns empty array.
Checklist:
- CRDs applied?
kubectl get crds | grep gostoa.dev - Tools created?
kubectl get tools -A - Watcher enabled? Set
K8S_WATCHER_ENABLED=true - Correct namespace? Tools must be in
tenant-{name}namespaces - Gateway restarted after CRD changes? (if watcher is disabled)
OPA Policy Denying All Requests
Symptom: All tools/call requests return 403.
# Check OPA is enabled and configured
kubectl logs -l app=mcp-gateway -n stoa-system | grep opa
# Verify policy files are mounted
kubectl exec -it <mcp-gateway-pod> -n stoa-system -- ls /policies/
Verify your Rego policy has a path to allow = true for your use case.
Kafka Metering Errors
Symptom: MCP Gateway logs show Kafka connection errors.
Checklist:
- Kafka/Redpanda running?
kubectl get pods -l app=redpanda - Bootstrap servers correct? Check
KAFKA_BOOTSTRAP_SERVERS - Topic exists?
kubectl exec redpanda-0 -- rpk topic list - Network policies allow traffic? Check cross-namespace rules
Portal
APIs Not Showing in Catalog
Symptom: Portal loads but catalog is empty.
Checklist:
- APIs published? Check in Console UI or via API
- Portal visibility enabled? API must have
portal.visible: true - API endpoint returns data?
curl ${STOA_API_URL}/v1/portal/apis - Auth working? Portal needs valid token to fetch catalog
Search Returns 500
Symptom: Searching in Portal returns HTTP 500.
This was a known issue (CAB-1044) caused by unescaped LIKE wildcards. Ensure you're running a version with the fix (commit 2c5672d8 or later).
Gateway Sync
API Not Synced to Gateway
Symptom: API shows as "Published" in Console but not reachable on the gateway.
Checklist:
- ArgoCD sync working? Check ArgoCD application status
- Gateway adapter healthy?
curl ${STOA_API_URL}/v1/gateways— checkstatus: online - Deployment sync status? Check
sync_statusin Console deployment view - Gateway credentials valid? Verify gateway instance
auth_configin the Control Plane
Drift Detected
Symptom: Console shows "Drift" status for an API.
Drift means the gateway state differs from the desired state in the Control Plane. The reconciliation engine detects this automatically. To trigger a manual re-sync:
# Via Control Plane API
curl -X POST ${STOA_API_URL}/v1/deployments/{deployment_id}/sync \
-H "Authorization: Bearer $TOKEN"
# Check deployment sync status
curl ${STOA_API_URL}/v1/deployments/{deployment_id} \
-H "Authorization: Bearer $TOKEN"
Vault
Vault Sealed
Symptom: Application returns errors related to Vault, logs show VaultSealedException.
Fix: Unseal Vault using the unseal keys:
vault operator unseal <key1>
vault operator unseal <key2>
vault operator unseal <key3>
This was addressed in CAB-1042 with automatic sealed detection (_ensure_unsealed()).
Performance
High API Latency
- Check database query times: Grafana → Control Plane API dashboard
- Check Kafka consumer lag:
kubectl exec redpanda-0 -- rpk group describe stoa-events - Check pod resource usage:
kubectl top pods -n stoa-system - Review slow query logs: Loki →
{app="control-plane-api"} | json | duration > 1s
High Memory Usage
- Check for connection pool leaks: Monitor
DB_POOL_SIZEusage in Prometheus - Review Kafka consumer configuration: Reduce batch sizes if needed
- Check for large API specifications being loaded into memory