Skip to main content

ADR-023: Zero Blind Spot Observability

Status: Accepted Date: 2026-01-26 Decision Makers: Christophe ABOULICAM Related Tickets: CAB-960, CAB-961, CAB-962, CAB-397


Context​

The Problem​

Real-world incident at a major financial institution:

  • Intermittent 502 errors affecting 5% of users
  • Root cause: Corporate proxy (Cisco) degradation
  • Zero visibility because proxy had no instrumentation
  • Diagnosis required: Manual dump + vendor ticket + weeks of analysis

This is the "blind spot" problem that plagues traditional API management:

  • Errors occur outside the gateway's visibility
  • Impact is diffuse (affects subset of users)
  • Correlation requires manual effort across multiple systems
  • MTTR measured in hours or days

The Anti-Pattern​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Client │──▢──│ Proxy/LB │──▢──│ Gateway │──▢── Backend
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
❌ NO METRICS
❌ NO TRACES
❌ NO CORRELATION

Many traditional gateways treat observability as opt-in:

  • Plugins must be enabled
  • Configuration is fragmented (metrics β‰  traces β‰  logs)
  • Can be disabled by misconfiguration
  • No visibility into intermediate hops

Decision​

The Principle​

"If a request passes through STOA, we know what happened β€” by design."

Observability in STOA is not a feature β€” it's a property of the system.

Implementation​

1. Observability is NATIVE (not opt-in)​

Every request through STOA Gateway automatically generates:

  • Metrics (Prometheus) β€” latency, status codes, throughput
  • Traces (OpenTelemetry) β€” distributed tracing with span context
  • Logs (structured JSON) β€” request/response metadata
  • Error Snapshots β€” full context capture on failures
// Pseudo-code: Every request handler
fn handle_request(req: Request) -> Response {
// Trace ID is MANDATORY
let trace_id = req.header("X-Trace-ID")
.unwrap_or_else(|| generate_trace_id());

let span = tracer.start_span("gateway.request")
.with_trace_id(trace_id);

// Metrics are AUTOMATIC
let timer = metrics.request_duration.start_timer();

// Processing...
let response = process(req);

// Error Snapshot on failure (AUTOMATIC)
if response.status >= 400 {
capture_error_snapshot(req, response, span);
}

timer.observe_duration();
response
}

2. Zero Configuration Required​

Traditional ApproachSTOA Approach
Plugin-based observability to enableNative, active by default
Separate config for metrics/traces/logsUnified by default
Can be disabledCore, non-disableable
Requires tuningWorks out-of-the-box

3. Trace ID is Mandatory​

  • If X-Trace-ID header present β†’ propagated
  • Otherwise β†’ auto-generated (UUID v4)
  • By design, every request receives a correlation ID

4. Timing Inference for Blind Spots​

Even when intermediate hops (proxies, load balancers) have no instrumentation:

T_network = T_total - T_gateway - T_backend

If T_network > baseline + 3Οƒ β†’ Network anomaly detected

# stoa.yaml
apis:
- name: payment-api
upstream: https://backend.internal

networkPath:
hops:
- name: corporate-proxy
address: proxy.corp.local:8080
monitoring:
enabled: true # DEFAULT
alertOnDegradation: true

Consequences​

Positive​

  1. MTTR: Minutes, not hours

    • Full context available immediately on any error
    • No need to search across multiple systems
  2. Self-Diagnostic capability

    • System can identify probable root cause automatically
    • Alerts include actionable context
  3. Competitive differentiation

    • Deep native observability as a core differentiator
    • Resonates strongly with enterprise ops teams
  4. Trust building

    • Designed so operators have full context on every request
    • Reduces finger-pointing between teams

Negative​

  1. Storage costs

    • Error Snapshots require S3/MinIO storage
    • Mitigated by retention policies and compression
  2. Performance overhead

    • Designed for minimal overhead per request for instrumentation
    • Tracing is sampled in high-volume scenarios
  3. Complexity

    • More components to maintain (Prometheus, Loki, OpenTelemetry collector)
    • Mitigated by Helm charts with sane defaults

Implementation Roadmap​

PhaseTicketComponentPriority
1CAB-960Observability by Default (Architecture)P0
2CAB-397Error Snapshot / Flight RecorderP1 βœ… Done
3CAB-961Self-Diagnostic EngineP1
4CAB-962Intermediate Hop DetectionP2

Validation Criteria​

Success Metrics​

  1. Any error can be diagnosed in <5 minutes with native STOA data
  2. By design, all requests have a trace ID in production
  3. All 5xx errors captured in Error Snapshots by default
  4. Network anomalies detected within 30 seconds

Test Scenarios​

  1. Backend timeout β†’ Error Snapshot captures full context
  2. Proxy degradation β†’ Timing inference detects anomaly
  3. Intermittent failures β†’ Pattern analysis identifies scope
  4. Cross-tenant correlation β†’ Self-diagnostic pinpoints root cause

References​


Quotes​

"L'observabilité n'est pas une feature, c'est une propriété du système."

"Avec un gateway traditionnel, une erreur 502 peut nΓ©cessiter de chercher dans plusieurs dashboards. Avec STOA, le contexte complet est disponible en un clic."


Document generated from production incident analysis β€” Proxy blind spot case (January 2026)