Week 1 with STOA: Operations Runbook
You've deployed STOA and your first API is live. Now what?
This runbook covers the practical operations tasks for your first 7 days β the things that aren't in the "quick start" guide but that you'll actually need to do. Think of it as the manual your future self wishes you had read.
Who this is for: Freelancers, indie hackers, and small teams running STOA in production for the first time.
Day 1: Verify Your Deploymentβ
Check All Services Are Healthyβ
# If running Docker Compose
docker compose ps
# All services should show "Up" and "healthy"
# Expected output:
# control-plane-api Up 0.0.0.0:8000->8000/tcp healthy
# control-plane-ui Up 0.0.0.0:3000->3000/tcp healthy
# stoa-gateway Up 0.0.0.0:3001->3001/tcp healthy
# developer-portal Up 0.0.0.0:3002->3002/tcp healthy
# postgres Up 0.0.0.0:5432->5432/tcp healthy
# keycloak Up 0.0.0.0:8080->8080/tcp healthy
Verify the Gateway Respondsβ
curl -s ${STOA_GATEWAY_URL}/health
# Expected: {"status":"ok","version":"0.1.0","uptime_seconds":3600}
Confirm Your API Is Reachableβ
# Replace with your API key and path
curl -s ${STOA_GATEWAY_URL}/your-api/endpoint \
-H "X-API-Key: your-api-key" \
-w "\nHTTP Status: %{http_code}\nLatency: %{time_total}s\n"
If you see HTTP Status: 200 and a reasonable latency (under 500ms for local backends), you're good.
Check Keycloak Is Accessibleβ
curl -s ${STOA_AUTH_URL}/health/ready
# Expected: {"status":"UP"}
Day 2: Set Up Monitoringβ
Enable STOA's Built-in Metricsβ
STOA exposes Prometheus metrics at /metrics. If you're using Docker Compose with the observability stack:
# Check Prometheus is scraping STOA
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "stoa-gateway") | .health'
# Expected: "up"
Open Grafana at http://localhost:3003 to see the pre-built dashboards:
- Gateway Overview: request rate, error rate, latency percentiles
- Consumer Usage: requests by consumer, top consumers by volume
- Rate Limiting: rejected requests over time
Create a Simple Health Check Scriptβ
Save this as check-stoa.sh and run it via cron:
#!/bin/bash
# check-stoa.sh β run every 5 minutes via cron
GATEWAY_URL="${STOA_GATEWAY_URL:-http://localhost:8080}"
ALERT_EMAIL="you@example.com"
response=$(curl -s -o /dev/null -w "%{http_code}" "$GATEWAY_URL/health")
if [ "$response" != "200" ]; then
echo "ALERT: STOA gateway returned $response at $(date)" | mail -s "STOA Health Alert" "$ALERT_EMAIL"
fi
Add to cron:
crontab -e
# Add:
*/5 * * * * /path/to/check-stoa.sh
Set Up Uptime Monitoringβ
For a free external health check, use UptimeRobot or Better Stack:
- Monitor:
https://your-gateway-domain/health - Check interval: every 5 minutes
- Alert: email or Slack webhook
This catches situations where your server is up but STOA is down.
Day 3: Manage Your Logsβ
Where Are the Logs?β
STOA logs to stdout by default (Docker captures these). View them:
# Gateway logs (requests, errors, rate limit events)
docker compose logs stoa-gateway --tail=100 --follow
# API logs (control plane, tenant management)
docker compose logs control-plane-api --tail=100 --follow
What to Look Forβ
Normal (ignore these):
INFO request completed path=/health status=200 latency=1ms
INFO rate_limit_check consumer=my-consumer remaining=95/100
Investigate these:
WARN rate_limit_exceeded consumer=my-consumer path=/api/endpoint
ERROR backend_error path=/api/endpoint status=502 error="connection refused"
ERROR auth_failed path=/api/endpoint reason="invalid_api_key"
Fix immediately:
ERROR database_connection_failed
PANIC recovery triggered # This shouldn't happen β file a bug report
Set Up Log Retentionβ
By default, Docker keeps logs indefinitely. Add log rotation to your docker-compose.yml:
services:
stoa-gateway:
logging:
driver: "json-file"
options:
max-size: "100m" # Max 100MB per log file
max-file: "5" # Keep 5 rotated files = 500MB max
Restart services after updating:
docker compose up -d
Query Logs for Specific Eventsβ
# Find all 5xx errors in the last hour
docker compose logs stoa-gateway --since=1h 2>/dev/null | grep '"status":5'
# Find rate limit events for a specific consumer
docker compose logs stoa-gateway 2>/dev/null | grep "rate_limit_exceeded.*my-consumer"
# Count requests by status code
docker compose logs stoa-gateway --since=24h 2>/dev/null | grep -oP '"status":\d+' | sort | uniq -c
# Query audit logs via API (structured, filterable)
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID?limit=50" \
-H "Authorization: Bearer $TOKEN" | jq '.logs[] | {action, resource, user, created_at}'
Day 4: Tune Your Policiesβ
Review Your Rate Limitsβ
After a few days of traffic, check if your limits are right:
TOKEN="your-admin-token"
TENANT_ID="your-tenant-id"
# Get rate limit events from the last 24h
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID?event_type=rate_limit_exceeded&hours=24" \
-H "Authorization: Bearer $TOKEN" | jq '.total_events'
Tune based on what you see:
| Events/Day | Action |
|---|---|
| 0 | Your limits might be too generous β OK unless you have high traffic |
| 1-50 | Normal β clients occasionally burst over the limit |
| 50-500 | Review which consumers are hitting limits β might need tiered plans |
| 500+ | Either a misbehaving client or your limits are too low for your traffic |
Update a Rate Limit Policyβ
POLICY_ID="your-policy-id"
curl -s -X PATCH "${STOA_API_URL}/v1/admin/policies/$POLICY_ID" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"config": {
"requests_per_minute": 200,
"burst": 20
}
}' | jq .
Rate limit changes take effect immediately β no restart required.
Add a CORS Policy (If Serving Browser Clients)β
If your API is called from web browsers, you need CORS headers:
CORS_POLICY_ID=$(curl -s -X POST "${STOA_API_URL}/v1/admin/policies" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "browser-cors",
"policy_type": "cors",
"tenant_id": "'$TENANT_ID'",
"scope": "api",
"config": {
"origins": ["https://yourapp.com", "http://localhost:3000"],
"methods": ["GET", "POST", "PUT", "DELETE"],
"headers": ["Content-Type", "X-API-Key"],
"max_age": 3600
}
}' | jq -r .id)
# Bind to your API
curl -s -X POST "${STOA_API_URL}/v1/admin/policies/bindings" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"policy_id": "'$CORS_POLICY_ID'",
"api_catalog_id": "'$API_ID'",
"tenant_id": "'$TENANT_ID'"
}' | jq .
Day 5: Onboard Your First Consumerβ
Invite a Client to the Developer Portalβ
Your clients can self-register at http://localhost:3002 (or your production portal URL). Walk them through:
- Sign up: click "Request Access", fill in name + email
- Browse APIs: they see your published APIs with descriptions
- Subscribe: click "Subscribe" on the API they need
- Get their key: after you approve, they get an API key in their dashboard
You approve subscriptions in the Console: Subscriptions β Pending β Approve.
Programmatic Consumer Creation (For Automation)β
If you're building a SaaS and want to auto-provision API keys when users sign up:
# Step 1: Create the consumer
CONSUMER_ID=$(curl -s -X POST "${STOA_API_URL}/v1/consumers/$TENANT_ID" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"external_id": "user-'$(date +%s)'",
"name": "user-12345",
"email": "user@example.com",
"consumer_metadata": {
"user_id": "12345",
"plan": "starter"
}
}' | jq -r .id)
# Step 2: Create a subscription (get the API key)
API_KEY=$(curl -s -X POST "${STOA_API_URL}/v1/subscriptions" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"consumer_id": "'$CONSUMER_ID'",
"api_id": "'$API_ID'",
"tenant_id": "'$TENANT_ID'"
}' | jq -r .api_key)
echo "API Key: $API_KEY"
Store this key in your user's account. You can revoke it later:
SUBSCRIPTION_ID="subscription-id-from-above"
curl -s -X POST "${STOA_API_URL}/v1/subscriptions/$SUBSCRIPTION_ID/revoke" \
-H "Authorization: Bearer $TOKEN"
Set Consumer-Specific Limitsβ
To give a premium consumer a higher rate limit, create a dedicated policy at the tenant scope and bind it to their consumer:
# Create a high-limit policy scoped to a specific consumer
curl -s -X POST "${STOA_API_URL}/v1/admin/policies" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "premium-rate-limit-'$CONSUMER_ID'",
"policy_type": "rate_limit",
"tenant_id": "'$TENANT_ID'",
"scope": "tenant",
"config": {
"requests_per_minute": 1000,
"burst": 50
},
"priority": 50
}' | jq .
Higher priority (lower number) policies override lower priority ones for the same consumer.
Day 6: Prepare for Incidentsβ
Build Your Runbookβ
Before something goes wrong, write down what you'll do. Keep it simple:
# STOA Incident Runbook
## Gateway Down (5xx on health check)
1. Check service: docker compose ps
2. Check logs: docker compose logs stoa-gateway --tail=50
3. Restart if needed: docker compose restart stoa-gateway
4. If still down: docker compose down && docker compose up -d
## Database Down
1. Check: docker compose ps postgres
2. Check logs: docker compose logs postgres --tail=20
3. Restart: docker compose restart postgres
4. Wait 30s, then restart dependent services
## Consumer Locked Out (invalid key)
1. Look up consumer: GET /v1/consumers/$TENANT_ID (filter by email)
2. Get their subscriptions: GET /v1/subscriptions/tenant/$TENANT_ID
3. Rotate key: POST /v1/subscriptions/$SUBSCRIPTION_ID/rotate-key
4. Send new key to client
## Rate Limit Misconfiguration
1. Identify policy: GET /v1/tenants/$TENANT_ID/policies
2. Adjust: PATCH /v1/tenants/$TENANT_ID/policies/$POLICY_ID
3. Changes take effect immediately
Test Your Recovery Proceduresβ
Practice before you need them:
# Test restart (should take <10s)
time docker compose restart stoa-gateway
# Test full stop/start (should take <30s)
time (docker compose down && docker compose up -d)
# Test key rotation
curl -s -X POST "${STOA_API_URL}/v1/tenants/$TENANT_ID/consumers/$CONSUMER_ID/rotate-key" \
-H "Authorization: Bearer $TOKEN" | jq .api_key
Day 7: Weekly Review Checklistβ
Run this every week:
#!/bin/bash
# weekly-stoa-review.sh
echo "=== STOA Weekly Review $(date +%Y-%m-%d) ==="
echo ""
echo "--- Service Health ---"
docker compose ps
echo ""
echo "--- Error Rate (last 7 days) ---"
docker compose logs stoa-gateway --since=168h 2>/dev/null | grep -c '"status":5' || echo "0 errors"
echo ""
echo "--- Top Rate Limited Consumers ---"
curl -s "${STOA_API_URL}/v1/tenants/$TENANT_ID/logs?event_type=rate_limit_exceeded&hours=168" \
-H "Authorization: Bearer $TOKEN" | jq '.logs | group_by(.consumer) | map({consumer: .[0].consumer, count: length}) | sort_by(.count) | reverse | .[0:5]'
echo ""
echo "--- Active Consumers ---"
curl -s "${STOA_API_URL}/v1/tenants/$TENANT_ID/consumers" \
-H "Authorization: Bearer $TOKEN" | jq '.total'
echo ""
echo "--- Disk Usage ---"
docker system df
What to Act Onβ
| Finding | Action |
|---|---|
| Any service unhealthy | Investigate logs immediately |
| Error rate >1% | Find root cause, fix before it grows |
| Consumer hitting rate limits daily | Consider upgrading their plan or increasing their limit |
| Disk >80% | Clean old logs: docker system prune (removes unused images/containers) |
| No requests for 24h+ | Check your client integration β something may have broken |
Common Week 1 Issuesβ
"Consumer gets 401 on every request"β
The API key is either wrong or not associated with the right API:
# Verify the consumer and their API associations
curl -s "${STOA_API_URL}/v1/tenants/$TENANT_ID/consumers?email=client@example.com" \
-H "Authorization: Bearer $TOKEN" | jq '.consumers[0] | {api_key, api_ids}'
"Gateway returns 502 Bad Gateway"β
Your backend is unreachable from the gateway container:
# Test backend connectivity from inside the gateway container
docker compose exec stoa-gateway curl -s http://your-backend:port/health
Common causes:
- Using
localhostfor backend URL (doesn't work inside Docker β usehost.docker.internalon Mac/Windows) - Backend service isn't started
- Firewall blocking the connection
"Logs fill up disk within days"β
Add the log rotation config from Day 3, or reduce your log verbosity:
# In docker-compose.yml
environment:
- LOG_LEVEL=warn # Only log warnings and errors (default is info)
"Keycloak token expired, admin API returns 401"β
Tokens expire after 5 minutes by default:
# Always get a fresh token before admin API calls
get_token() {
curl -s -X POST ${STOA_AUTH_URL}/realms/stoa/protocol/openid-connect/token \
-d "client_id=control-plane-api&client_secret=${KC_SECRET}&grant_type=client_credentials" \
| jq -r .access_token
}
TOKEN=$(get_token)
What Comes Nextβ
After week 1, you should have a stable, monitored API gateway. From here:
- Security Series: Part 1 β Your APIs Are More Vulnerable Than You Think β deeper security hardening
- Rate Limiting Strategies That Actually Work β beyond basic rate limiting
- Authentication Guide β JWT, OAuth 2.0, and mTLS options
- Observability Guide β full Prometheus + Grafana setup
- Consumer Onboarding β self-service portal workflows