Observability

STOA includes a built-in observability stack with Prometheus for metrics, Grafana for dashboards, and Loki for log aggregation.

Stack Overview

Component	Purpose	Access
Prometheus	Metrics collection and alerting	Internal (in-cluster)
Grafana	Dashboards and visualization	`https://grafana.<YOUR_DOMAIN>`
Loki	Log aggregation and search	Via Grafana

Metrics

Control Plane API

The FastAPI backend exposes Prometheus metrics at /metrics:

Metric	Type	Description
`http_requests_total`	Counter	Total HTTP requests by method, path, status
`http_request_duration_seconds`	Histogram	Request latency distribution
`db_query_duration_seconds`	Histogram	Database query latency
`kafka_events_published_total`	Counter	Events published to Kafka

MCP Gateway

Metric	Type	Description
`mcp_tool_calls_total`	Counter	Tool invocations by name, tenant, status
`mcp_tool_call_duration_seconds`	Histogram	Tool call latency
`opa_policy_evaluations_total`	Counter	OPA policy checks by result
`mcp_active_connections`	Gauge	Current SSE connections

Kafka / Redpanda

Metric	Type	Description
`kafka_consumer_lag`	Gauge	Consumer lag per topic/partition
`kafka_messages_in_total`	Counter	Messages produced per topic

Grafana Dashboards

Platform Overview

The main dashboard shows:

Request rate and error rate across all services
Latency percentiles (p50, p95, p99)
Active users and sessions
Database connection pool usage

MCP Gateway Dashboard

Tool call rate by tenant and tool name
OPA policy allow/deny ratio
SSE connection count
Metering pipeline throughput

Tenant Dashboard

Per-tenant view showing:

API call volume
Subscription activity
Error breakdown by API
Rate limit utilization

Log Aggregation

Querying Logs

Access logs via Grafana's Explore view with Loki:

# All error logs in the last 5 minutes
{job=~".+"} |= "level=error"

# Control Plane API logs for a specific tenant
{app="control-plane-api"} |= "tenant=acme"

# MCP Gateway tool calls
{app="mcp-gateway"} |= "tool.call"

# Slow requests (>1s)
{app="control-plane-api"} | json | duration > 1s

Log Format

All STOA services log in structured JSON format:

{
  "timestamp": "2026-02-04T10:30:00Z",
  "level": "INFO",
  "service": "control-plane-api",
  "message": "API published",
  "tenant": "acme",
  "api": "petstore",
  "version": "1.0.0",
  "trace_id": "abc123"
}

Set LOG_FORMAT=text for human-readable output during local development.

Alerting

Prometheus Alert Rules

STOA ships with default alert rules:

Alert	Condition	Severity
`HighErrorRate`	Error rate > 5% for 5 minutes	Critical
`HighLatency`	p99 latency > 2s for 5 minutes	Warning
`DatabaseConnectionExhausted`	Connection pool > 90%	Critical
`KafkaConsumerLag`	Lag > 10000 for 10 minutes	Warning
`MCPGatewayDown`	No healthy pods for 1 minute	Critical
`VaultSealed`	Vault sealed status detected	Critical

Alert Destinations

Configure alert routing in alertmanager.yaml:

Slack: Channel notifications for team alerts
PagerDuty: On-call escalation for critical alerts
Email: Summary digests for non-critical alerts

Runbooks

STOA includes operational runbooks organized by severity:

Critical

Gateway Down — Gateway unreachable
Database Connection — PostgreSQL connectivity issues
Vault Sealed — Vault requires unsealing
Vault Restore — Vault disaster recovery

High Priority

Certificate Expiration — TLS cert approaching expiry
Gateway High Latency — Response time degradation
Kafka Lag — Consumer falling behind

Medium Priority

API Rollback — Revert a bad API deployment
Gateway Adapter Failure — Gateway instance unreachable or degraded

Health Endpoints

All services expose health check endpoints:

Endpoint	Purpose
`/health/ready`	Readiness probe — all dependencies reachable
`/health/live`	Liveness probe — process alive
`/health/startup`	Startup probe — initialization complete

curl ${STOA_API_URL}/health/ready
# {"status": "ok"}

Stack Overview​

Metrics​

Control Plane API​

MCP Gateway​

Kafka / Redpanda​

Grafana Dashboards​

Platform Overview​

MCP Gateway Dashboard​

Tenant Dashboard​

Log Aggregation​

Querying Logs​

Log Format​

Alerting​

Prometheus Alert Rules​

Alert Destinations​

Runbooks​

Critical​

High Priority​

Medium Priority​

Health Endpoints​