ADR-023: Zero Blind Spot Observability
Status: Accepted Date: 2026-01-26 Decision Makers: Christophe ABOULICAM Related Tickets: CAB-960, CAB-961, CAB-962, CAB-397
Contextβ
The Problemβ
Real-world incident at a major financial institution:
- Intermittent 502 errors affecting 5% of users
- Root cause: Corporate proxy (Cisco) degradation
- Zero visibility because proxy had no instrumentation
- Diagnosis required: Manual dump + vendor ticket + weeks of analysis
This is the "blind spot" problem that plagues traditional API management:
- Errors occur outside the gateway's visibility
- Impact is diffuse (affects subset of users)
- Correlation requires manual effort across multiple systems
- MTTR measured in hours or days
The Anti-Patternβ
ββββββββββββ ββββββββββββββββ βββββββββββ
β Client ββββΆβββ Proxy/LB ββββΆβββ Gateway ββββΆββ Backend
ββββββββββββ ββββββββββββββββ βββββββββββ
β
β NO METRICS
β NO TRACES
β NO CORRELATION
Many traditional gateways treat observability as opt-in:
- Plugins must be enabled
- Configuration is fragmented (metrics β traces β logs)
- Can be disabled by misconfiguration
- No visibility into intermediate hops
Decisionβ
The Principleβ
"If a request passes through STOA, we know what happened β by design."
Observability in STOA is not a feature β it's a property of the system.
Implementationβ
1. Observability is NATIVE (not opt-in)β
Every request through STOA Gateway automatically generates:
- Metrics (Prometheus) β latency, status codes, throughput
- Traces (OpenTelemetry) β distributed tracing with span context
- Logs (structured JSON) β request/response metadata
- Error Snapshots β full context capture on failures
// Pseudo-code: Every request handler
fn handle_request(req: Request) -> Response {
// Trace ID is MANDATORY
let trace_id = req.header("X-Trace-ID")
.unwrap_or_else(|| generate_trace_id());
let span = tracer.start_span("gateway.request")
.with_trace_id(trace_id);
// Metrics are AUTOMATIC
let timer = metrics.request_duration.start_timer();
// Processing...
let response = process(req);
// Error Snapshot on failure (AUTOMATIC)
if response.status >= 400 {
capture_error_snapshot(req, response, span);
}
timer.observe_duration();
response
}
2. Zero Configuration Requiredβ
| Traditional Approach | STOA Approach |
|---|---|
| Plugin-based observability to enable | Native, active by default |
| Separate config for metrics/traces/logs | Unified by default |
| Can be disabled | Core, non-disableable |
| Requires tuning | Works out-of-the-box |
3. Trace ID is Mandatoryβ
- If
X-Trace-IDheader present β propagated - Otherwise β auto-generated (UUID v4)
- By design, every request receives a correlation ID
4. Timing Inference for Blind Spotsβ
Even when intermediate hops (proxies, load balancers) have no instrumentation:
T_network = T_total - T_gateway - T_backend
If T_network > baseline + 3Ο β Network anomaly detected
# stoa.yaml
apis:
- name: payment-api
upstream: https://backend.internal
networkPath:
hops:
- name: corporate-proxy
address: proxy.corp.local:8080
monitoring:
enabled: true # DEFAULT
alertOnDegradation: true
Consequencesβ
Positiveβ
-
MTTR: Minutes, not hours
- Full context available immediately on any error
- No need to search across multiple systems
-
Self-Diagnostic capability
- System can identify probable root cause automatically
- Alerts include actionable context
-
Competitive differentiation
- Deep native observability as a core differentiator
- Resonates strongly with enterprise ops teams
-
Trust building
- Designed so operators have full context on every request
- Reduces finger-pointing between teams
Negativeβ
-
Storage costs
- Error Snapshots require S3/MinIO storage
- Mitigated by retention policies and compression
-
Performance overhead
- Designed for minimal overhead per request for instrumentation
- Tracing is sampled in high-volume scenarios
-
Complexity
- More components to maintain (Prometheus, Loki, OpenTelemetry collector)
- Mitigated by Helm charts with sane defaults
Implementation Roadmapβ
| Phase | Ticket | Component | Priority |
|---|---|---|---|
| 1 | CAB-960 | Observability by Default (Architecture) | P0 |
| 2 | CAB-397 | Error Snapshot / Flight Recorder | P1 β Done |
| 3 | CAB-961 | Self-Diagnostic Engine | P1 |
| 4 | CAB-962 | Intermediate Hop Detection | P2 |
Validation Criteriaβ
Success Metricsβ
- Any error can be diagnosed in
<5 minuteswith native STOA data - By design, all requests have a trace ID in production
- All 5xx errors captured in Error Snapshots by default
- Network anomalies detected within 30 seconds
Test Scenariosβ
- Backend timeout β Error Snapshot captures full context
- Proxy degradation β Timing inference detects anomaly
- Intermittent failures β Pattern analysis identifies scope
- Cross-tenant correlation β Self-diagnostic pinpoints root cause
Referencesβ
- CAB-960: Observability by Default
- CAB-961: Self-Diagnostic Engine
- CAB-962: Intermediate Hop Detection
- CAB-397: Error Snapshot
- Inspiration: AWS X-Ray Insights, Datadog Watchdog, Honeycomb BubbleUp
Quotesβ
"L'observabilité n'est pas une feature, c'est une propriété du système."
"Avec un gateway traditionnel, une erreur 502 peut nΓ©cessiter de chercher dans plusieurs dashboards. Avec STOA, le contexte complet est disponible en un clic."
Document generated from production incident analysis β Proxy blind spot case (January 2026)