Aller au contenu principal

Monitoring & Alerting

STOA integrates with Prometheus and Grafana for metrics collection, dashboards, and alerting. This guide covers setup, custom metrics, SLO rules, and alert configuration.

Architecture

Prerequisites

Install the kube-prometheus-stack Helm chart (includes Prometheus, Grafana, and Alertmanager):

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace \
-f prometheus-values.yaml

ServiceMonitors

STOA ships ServiceMonitor CRDs in the Helm chart. Enable them in values.yaml:

stoaGateway:
serviceMonitor:
enabled: true
interval: 15s
scrapeTimeout: 10s

Deployed ServiceMonitors

TargetPortPathIntervalLabels
STOA Gatewayhttp/metrics15sapp.kubernetes.io/name: stoa-gateway
MCP Gatewayhttp/metrics15sapp.kubernetes.io/name: mcp-gateway
Pushgatewayhttp/metrics30sapp: pushgateway

All ServiceMonitors use the label prometheus: kube-prometheus-stack to match the Prometheus operator's selector.

Verify Scrape Targets

Check that Prometheus discovers STOA targets:

# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090

# Open http://localhost:9090/targets and look for stoa-gateway target

Custom Metrics

STOA Gateway (Rust)

The gateway exposes 12 custom metric families at /metrics:

HTTP Metrics

MetricTypeLabelsDescription
stoa_http_requests_totalCountermethod, path, statusTotal HTTP requests
stoa_http_request_duration_secondsHistogrammethod, pathRequest latency

MCP Tool Metrics

MetricTypeLabelsDescription
stoa_mcp_tools_calls_totalCountertool, tenant, statusTool invocations
stoa_mcp_tool_duration_secondsHistogramtool, tenant, statusTool execution latency

SSE Connection Metrics

MetricTypeLabelsDescription
stoa_mcp_sse_connections_activeGauge--Active SSE connections
stoa_mcp_sse_connection_duration_secondsHistogramtenantSSE session duration
stoa_mcp_sessions_activeGauge--Active MCP sessions

Rate Limiting Metrics

MetricTypeLabelsDescription
stoa_rate_limit_hits_totalCountertenantRate limit rejections
stoa_rate_limit_bucketsGauge--Active rate limit buckets

Circuit Breaker Metrics

MetricTypeLabelsDescription
stoa_circuit_breaker_stateGaugeupstreamCB state: 0=closed, 1=open, 2=half-open

Quota & Upstream Metrics

MetricTypeLabelsDescription
stoa_quota_remainingGaugeconsumer, periodRemaining quota
stoa_upstream_latency_secondsHistogramupstream, statusBackend latency

Histogram Buckets

All latency histograms use the same bucket boundaries:

[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] seconds

SSE connection duration uses wider buckets:

[1.0, 5.0, 10.0, 30.0, 60.0, 120.0, 300.0, 600.0, 1800.0] seconds

SLO Recording Rules

STOA ships recording rules for SLO tracking. Apply them as a PrometheusRule:

kubectl apply -f deploy/prometheus/stoa-slo-rules.yaml -n monitoring

Availability SLO

Target: 99.9% (0.1% error budget)

# Recording rule: slo:api_availability:ratio
sum(rate(stoa_http_requests_total{status!~"5.."}[5m]))
/
sum(rate(stoa_http_requests_total[5m]))

Latency SLO

Target: P95 < 500ms

# Recording rule: slo:api_latency_p95:seconds
histogram_quantile(0.95,
sum(rate(stoa_http_request_duration_seconds_bucket[5m])) by (le)
)

APDEX Score

Target: >= 0.85 (satisfied < 250ms, tolerating < 1s)

# Recording rule: slo:apdex:score
(
sum(rate(stoa_http_request_duration_seconds_bucket{le="0.25"}[5m]))
+ sum(rate(stoa_http_request_duration_seconds_bucket{le="1.0"}[5m]))
)
/ (2 * sum(rate(stoa_http_request_duration_seconds_count[5m])))

Error Budget

# Recording rule: slo:error_budget:remaining_ratio
1 - (
sum(increase(stoa_http_requests_total{status=~"5.."}[30d]))
/ (sum(increase(stoa_http_requests_total[30d])) * 0.001)
)

Business Metrics

RuleIntervalDescription
business:active_tenants:count5mCount of distinct active tenants
business:api_calls:hourly_rate5mHourly API call rate
business:tool_usage:1h_rate5mHourly tool invocation rate
business:billable_requests_by_tenant:daily5mDaily billable requests per tenant

Alerting Rules

Gateway Alerts

groups:
- name: stoa.stoa-gateway.rules
rules:
- alert: StoaGatewayHighErrorRate
expr: |
sum(rate(stoa_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(stoa_http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "STOA Gateway error rate above 5%"

- alert: StoaGatewayHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(stoa_http_request_duration_seconds_bucket[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "STOA Gateway P99 latency above 2s"

- alert: StoaGatewayDown
expr: up{job=~".*stoa-gateway.*"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "STOA Gateway is down"

SLO Alerts

      - alert: ErrorBudgetLow
expr: slo:error_budget:remaining_ratio < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "Error budget below 20% — slow down deploys"

- alert: ErrorBudgetExhausted
expr: slo:error_budget:remaining_ratio < 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error budget exhausted — freeze changes"

- alert: SLOAvailabilityBreach
expr: slo:api_availability_30d:ratio < 0.999
for: 10m
labels:
severity: critical
annotations:
summary: "30-day availability below 99.9% SLO"

Kubernetes Alerts

      - alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total{
namespace="stoa-system"
}[15m]) > 3
labels:
severity: critical

- alert: PodHighMemory
expr: |
container_memory_usage_bytes{namespace="stoa-system"}
/ container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning

Full Alert Inventory

GroupAlertsSeverities
STOA GatewayError rate, latency, down, OIDC failurecritical, warning
Control Plane APIError rate, latency, downcritical, warning
DatabaseDown, high connections, slow queriescritical, warning
KubernetesPod not ready, crash loops, high memory/CPUcritical, warning
DiskSpace high (under 20%), space critical (under 10%), PVCwarning, critical
KeycloakDown, high login failurescritical, warning
RedpandaDown, consumer lag >10kcritical, warning
SLOError budget low/exhausted, availability breachwarning, critical

Grafana Dashboards

STOA provides 12 pre-built dashboards. Import them from docker/observability/grafana/dashboards/:

DashboardPurposeKey Panels
Platform OverviewHigh-level healthRequests/sec, error rate, P95 latency, service status
Control Plane APIBackend performanceEndpoint latency, errors, throughput
MCP GatewayGateway metricsTool invocations, token consumption, SSE connections
SLO DashboardSLO complianceAPDEX score, error budget, availability over time
Gateway RED MethodRate/Errors/DurationRED method visualization per endpoint
Gateway ArenaBenchmark leaderboardSTOA vs Kong vs Gravitee scores
Service HealthPod healthRestarts, readiness, resource usage
InfrastructureNode resourcesCPU, memory, network, disk per node
Error TrackingError analysisError categories, stack traces, trends
Logs ExplorerLog searchLoki-based log queries with filters
Token OptimizationToken usageConsumption rate by tenant, cost projection
MCP MigrationPython to RustShadow mode comparison, canary metrics

Import Dashboards

# Via Grafana API
for f in docker/observability/grafana/dashboards/*.json; do
curl -X POST "${GRAFANA_URL}/api/dashboards/db" \
-H "Authorization: Bearer ${GRAFANA_TOKEN}" \
-H "Content-Type: application/json" \
-d "{\"dashboard\": $(cat "$f"), \"overwrite\": true, \"folderId\": 0}"
done

Or use Grafana provisioning (recommended for GitOps):

# grafana-values.yaml
grafana:
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: STOA
folder: STOA
type: file
options:
path: /var/lib/grafana/dashboards
dashboardsConfigMaps:
stoa: stoa-grafana-dashboards

Useful PromQL Queries

Request Rate by Status

sum by (status) (rate(stoa_http_requests_total[5m]))

P95 Latency by Endpoint

histogram_quantile(0.95,
sum by (path, le) (rate(stoa_http_request_duration_seconds_bucket[5m]))
)

Active SSE Connections Over Time

stoa_mcp_sse_connections_active

Top 10 Tools by Invocation Rate

topk(10, sum by (tool) (rate(stoa_mcp_tools_calls_total[1h])))

Circuit Breaker Status

stoa_circuit_breaker_state
# 0 = closed (healthy), 1 = open (tripped), 2 = half-open (testing)

Error Budget Burn Rate

slo:error_budget:burn_rate_1h
# Values > 1.0 mean burning faster than sustainable

Troubleshooting

ProblemCauseFix
No metrics from gatewayServiceMonitor not matchedVerify prometheus: kube-prometheus-stack label
Stale metricsPod restarted, counters resetExpected behavior; rate() handles resets
Dashboard shows "No data"Wrong datasource or namespaceCheck Grafana datasource points to correct Prometheus
Alerts not firingPrometheusRule not appliedkubectl get prometheusrule -n monitoring
High cardinality warningToo many unique label valuesReduce path label cardinality with route grouping
Grafana SSO not workingMissing Keycloak clientCreate stoa-observability client (see Keycloak Admin)