Aller au contenu principal

Observability

STOA includes a built-in observability stack with Prometheus for metrics, Grafana for dashboards, and Loki for log aggregation.

Stack Overview

ComponentPurposeAccess
PrometheusMetrics collection and alertingInternal (in-cluster)
GrafanaDashboards and visualizationhttps://grafana.<YOUR_DOMAIN>
LokiLog aggregation and searchVia Grafana

Metrics

Control Plane API

The FastAPI backend exposes Prometheus metrics at /metrics:

MetricTypeDescription
http_requests_totalCounterTotal HTTP requests by method, path, status
http_request_duration_secondsHistogramRequest latency distribution
db_query_duration_secondsHistogramDatabase query latency
kafka_events_published_totalCounterEvents published to Kafka

MCP Gateway

MetricTypeDescription
mcp_tool_calls_totalCounterTool invocations by name, tenant, status
mcp_tool_call_duration_secondsHistogramTool call latency
opa_policy_evaluations_totalCounterOPA policy checks by result
mcp_active_connectionsGaugeCurrent SSE connections

Kafka / Redpanda

MetricTypeDescription
kafka_consumer_lagGaugeConsumer lag per topic/partition
kafka_messages_in_totalCounterMessages produced per topic

Grafana Dashboards

Platform Overview

The main dashboard shows:

  • Request rate and error rate across all services
  • Latency percentiles (p50, p95, p99)
  • Active users and sessions
  • Database connection pool usage

MCP Gateway Dashboard

  • Tool call rate by tenant and tool name
  • OPA policy allow/deny ratio
  • SSE connection count
  • Metering pipeline throughput

Tenant Dashboard

Per-tenant view showing:

  • API call volume
  • Subscription activity
  • Error breakdown by API
  • Rate limit utilization

Log Aggregation

Querying Logs

Access logs via Grafana's Explore view with Loki:

# All error logs in the last 5 minutes
{job=~".+"} |= "level=error"

# Control Plane API logs for a specific tenant
{app="control-plane-api"} |= "tenant=acme"

# MCP Gateway tool calls
{app="mcp-gateway"} |= "tool.call"

# Slow requests (>1s)
{app="control-plane-api"} | json | duration > 1s

Log Format

All STOA services log in structured JSON format:

{
"timestamp": "2026-02-04T10:30:00Z",
"level": "INFO",
"service": "control-plane-api",
"message": "API published",
"tenant": "acme",
"api": "petstore",
"version": "1.0.0",
"trace_id": "abc123"
}

Set LOG_FORMAT=text for human-readable output during local development.

Alerting

Prometheus Alert Rules

STOA ships with default alert rules:

AlertConditionSeverity
HighErrorRateError rate > 5% for 5 minutesCritical
HighLatencyp99 latency > 2s for 5 minutesWarning
DatabaseConnectionExhaustedConnection pool > 90%Critical
KafkaConsumerLagLag > 10000 for 10 minutesWarning
MCPGatewayDownNo healthy pods for 1 minuteCritical
VaultSealedVault sealed status detectedCritical

Alert Destinations

Configure alert routing in alertmanager.yaml:

  • Slack: Channel notifications for team alerts
  • PagerDuty: On-call escalation for critical alerts
  • Email: Summary digests for non-critical alerts

Runbooks

STOA includes operational runbooks organized by severity:

Critical

  • Gateway Down — Gateway unreachable
  • Database Connection — PostgreSQL connectivity issues
  • Vault Sealed — Vault requires unsealing
  • Vault Restore — Vault disaster recovery

High Priority

  • Certificate Expiration — TLS cert approaching expiry
  • Gateway High Latency — Response time degradation
  • Kafka Lag — Consumer falling behind

Medium Priority

  • API Rollback — Revert a bad API deployment
  • Gateway Adapter Failure — Gateway instance unreachable or degraded

Health Endpoints

All services expose health check endpoints:

EndpointPurpose
/health/readyReadiness probe — all dependencies reachable
/health/liveLiveness probe — process alive
/health/startupStartup probe — initialization complete
curl ${STOA_API_URL}/health/ready
# {"status": "ok"}