Observability
STOA includes a built-in observability stack with Prometheus for metrics, Grafana for dashboards, and Loki for log aggregation.
Stack Overviewβ
| Component | Purpose | Access |
|---|---|---|
| Prometheus | Metrics collection and alerting | Internal (in-cluster) |
| Grafana | Dashboards and visualization | https://grafana.<YOUR_DOMAIN> |
| Loki | Log aggregation and search | Via Grafana |
Metricsβ
Control Plane APIβ
The FastAPI backend exposes Prometheus metrics at /metrics:
| Metric | Type | Description |
|---|---|---|
http_requests_total | Counter | Total HTTP requests by method, path, status |
http_request_duration_seconds | Histogram | Request latency distribution |
db_query_duration_seconds | Histogram | Database query latency |
kafka_events_published_total | Counter | Events published to Kafka |
MCP Gatewayβ
| Metric | Type | Description |
|---|---|---|
mcp_tool_calls_total | Counter | Tool invocations by name, tenant, status |
mcp_tool_call_duration_seconds | Histogram | Tool call latency |
opa_policy_evaluations_total | Counter | OPA policy checks by result |
mcp_active_connections | Gauge | Current SSE connections |
Kafka / Redpandaβ
| Metric | Type | Description |
|---|---|---|
kafka_consumer_lag | Gauge | Consumer lag per topic/partition |
kafka_messages_in_total | Counter | Messages produced per topic |
Grafana Dashboardsβ
Platform Overviewβ
The main dashboard shows:
- Request rate and error rate across all services
- Latency percentiles (p50, p95, p99)
- Active users and sessions
- Database connection pool usage
MCP Gateway Dashboardβ
- Tool call rate by tenant and tool name
- OPA policy allow/deny ratio
- SSE connection count
- Metering pipeline throughput
Tenant Dashboardβ
Per-tenant view showing:
- API call volume
- Subscription activity
- Error breakdown by API
- Rate limit utilization
Log Aggregationβ
Querying Logsβ
Access logs via Grafana's Explore view with Loki:
# All error logs in the last 5 minutes
{job=~".+"} |= "level=error"
# Control Plane API logs for a specific tenant
{app="control-plane-api"} |= "tenant=acme"
# MCP Gateway tool calls
{app="mcp-gateway"} |= "tool.call"
# Slow requests (>1s)
{app="control-plane-api"} | json | duration > 1s
Log Formatβ
All STOA services log in structured JSON format:
{
"timestamp": "2026-02-04T10:30:00Z",
"level": "INFO",
"service": "control-plane-api",
"message": "API published",
"tenant": "acme",
"api": "petstore",
"version": "1.0.0",
"trace_id": "abc123"
}
Set LOG_FORMAT=text for human-readable output during local development.
Alertingβ
Prometheus Alert Rulesβ
STOA ships with default alert rules:
| Alert | Condition | Severity |
|---|---|---|
HighErrorRate | Error rate > 5% for 5 minutes | Critical |
HighLatency | p99 latency > 2s for 5 minutes | Warning |
DatabaseConnectionExhausted | Connection pool > 90% | Critical |
KafkaConsumerLag | Lag > 10000 for 10 minutes | Warning |
MCPGatewayDown | No healthy pods for 1 minute | Critical |
VaultSealed | Vault sealed status detected | Critical |
Alert Destinationsβ
Configure alert routing in alertmanager.yaml:
- Slack: Channel notifications for team alerts
- PagerDuty: On-call escalation for critical alerts
- Email: Summary digests for non-critical alerts
Runbooksβ
STOA includes operational runbooks organized by severity:
Criticalβ
- Gateway Down β Gateway unreachable
- Database Connection β PostgreSQL connectivity issues
- Vault Sealed β Vault requires unsealing
- Vault Restore β Vault disaster recovery
High Priorityβ
- Certificate Expiration β TLS cert approaching expiry
- Gateway High Latency β Response time degradation
- Kafka Lag β Consumer falling behind
Medium Priorityβ
- API Rollback β Revert a bad API deployment
- Gateway Adapter Failure β Gateway instance unreachable or degraded
Health Endpointsβ
All services expose health check endpoints:
| Endpoint | Purpose |
|---|---|
/health/ready | Readiness probe β all dependencies reachable |
/health/live | Liveness probe β process alive |
/health/startup | Startup probe β initialization complete |
curl ${STOA_API_URL}/health/ready
# {"status": "ok"}