ADR-023: Zero Blind Spot Observability
Status: Accepted Date: 2026-01-26 Decision Makers: Christophe ABOULICAM Related Tickets: CAB-960, CAB-961, CAB-962, CAB-397
Context
The Problem
Real-world incident at a major financial institution:
- Intermittent 502 errors affecting 5% of users
- Root cause: Corporate proxy (Cisco) degradation
- Zero visibility because proxy had no instrumentation
- Diagnosis required: Manual dump + vendor ticket + weeks of analysis
This is the "blind spot" problem that plagues traditional API management:
- Errors occur outside the gateway's visibility
- Impact is diffuse (affects subset of users)
- Correlation requires manual effort across multiple systems
- MTTR measured in hours or days
The Anti-Pattern
┌──────────┐ ┌──────────────┐ ┌─────────┐
│ Client │──▶──│ Proxy/LB │──▶──│ Gateway │──▶── Backend
└──────────┘ └──────────────┘ └─────────┘
│
❌ NO METRICS
❌ NO TRACES
❌ NO CORRELATION
Many traditional gateways treat observability as opt-in:
- Plugins must be enabled
- Configuration is fragmented (metrics ≠ traces ≠ logs)
- Can be disabled by misconfiguration
- No visibility into intermediate hops
Decision
The Principle
"If a request passes through STOA, we know what happened — by design."
Observability in STOA is not a feature — it's a property of the system.
Implementation
1. Observability is NATIVE (not opt-in)
Every request through STOA Gateway automatically generates:
- Metrics (Prometheus) — latency, status codes, throughput
- Traces (OpenTelemetry) — distributed tracing with span context
- Logs (structured JSON) — request/response metadata
- Error Snapshots — full context capture on failures
// Pseudo-code: Every request handler
fn handle_request(req: Request) -> Response {
// Trace ID is MANDATORY
let trace_id = req.header("X-Trace-ID")
.unwrap_or_else(|| generate_trace_id());
let span = tracer.start_span("gateway.request")
.with_trace_id(trace_id);
// Metrics are AUTOMATIC
let timer = metrics.request_duration.start_timer();
// Processing...
let response = process(req);
// Error Snapshot on failure (AUTOMATIC)
if response.status >= 400 {
capture_error_snapshot(req, response, span);
}
timer.observe_duration();
response
}
2. Zero Configuration Required
| Traditional Approach | STOA Approach |
|---|---|
| Plugin-based observability to enable | Native, active by default |
| Separate config for metrics/traces/logs | Unified by default |
| Can be disabled | Core, non-disableable |
| Requires tuning | Works out-of-the-box |
3. Trace ID is Mandatory
- If
X-Trace-IDheader present → propagated - Otherwise → auto-generated (UUID v4)
- By design, every request receives a correlation ID
4. Timing Inference for Blind Spots
Even when intermediate hops (proxies, load balancers) have no instrumentation:
T_network = T_total - T_gateway - T_backend
If T_network > baseline + 3σ → Network anomaly detected
# stoa.yaml
apis:
- name: payment-api
upstream: https://backend.internal
networkPath:
hops:
- name: corporate-proxy
address: proxy.corp.local:8080
monitoring:
enabled: true # DEFAULT
alertOnDegradation: true
Consequences
Positive
-
MTTR: Minutes, not hours
- Full context available immediately on any error
- No need to search across multiple systems
-
Self-Diagnostic capability
- System can identify probable root cause automatically
- Alerts include actionable context
-
Competitive differentiation
- Deep native observability as a core differentiator
- Resonates strongly with enterprise ops teams
-
Trust building
- Designed so operators have full context on every request
- Reduces finger-pointing between teams
Negative
-
Storage costs
- Error Snapshots require S3/MinIO storage
- Mitigated by retention policies and compression
-
Performance overhead
- Designed for minimal overhead per request for instrumentation
- Tracing is sampled in high-volume scenarios
-
Complexity
- More components to maintain (Prometheus, Loki, OpenTelemetry collector)
- Mitigated by Helm charts with sane defaults
Implementation Roadmap
| Phase | Ticket | Component | Priority |
|---|---|---|---|
| 1 | CAB-960 | Observability by Default (Architecture) | P0 |
| 2 | CAB-397 | Error Snapshot / Flight Recorder | P1 ✅ Done |
| 3 | CAB-961 | Self-Diagnostic Engine | P1 |
| 4 | CAB-962 | Intermediate Hop Detection | P2 |
Validation Criteria
Success Metrics
- Any error can be diagnosed in
<5 minuteswith native STOA data - By design, all requests have a trace ID in production
- All 5xx errors captured in Error Snapshots by default
- Network anomalies detected within 30 seconds
Test Scenarios
- Backend timeout → Error Snapshot captures full context
- Proxy degradation → Timing inference detects anomaly
- Intermittent failures → Pattern analysis identifies scope
- Cross-tenant correlation → Self-diagnostic pinpoints root cause
References
- CAB-960: Observability by Default
- CAB-961: Self-Diagnostic Engine
- CAB-962: Intermediate Hop Detection
- CAB-397: Error Snapshot
- Inspiration: AWS X-Ray Insights, Datadog Watchdog, Honeycomb BubbleUp
Quotes
"L'observabilité n'est pas une feature, c'est une propriété du système."
"Avec un gateway traditionnel, une erreur 502 peut nécessiter de chercher dans plusieurs dashboards. Avec STOA, le contexte complet est disponible en un clic."
Document generated from production incident analysis — Proxy blind spot case (January 2026)