ADR-009: Error Snapshots β Time-Travel Debugging with PII Masking
Metadataβ
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-02-06 |
| Linear | CAB-397 |
Contextβ
When MCP tool invocations fail, debugging requires understanding the complete context:
- What was the request?
- What user/tenant initiated it?
- What LLM context existed?
- Were there retries?
- What was the cost impact?
Traditional logging captures fragments. Developers must correlate logs across services manually. This is especially problematic for:
- Intermittent failures
- Rate limit breaches
- Backend timeouts
- Schema validation errors
The Problemβ
"An AI agent failed mid-conversation. The user reports it 2 hours later. How do we reconstruct exactly what happened?"
Decisionβ
Implement Error Snapshots β complete point-in-time captures of failed requests with automatic PII masking and cURL replay generation.
Architectureβ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MCP Gateway β
β β
β ββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββββββ β
β β Tool βββββΆβ Error Handler βββββΆβ capture_mcp_error() β β
β β Invocation β β β β β β
β β (fails) β β status >= 400 β β - Build context β β
β ββββββββββββββ ββββββββββββββββββ β - Mask PII β β
β β - Calculate cost β β
β β - Publish async β β
β βββββββββββββ¬βββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Snapshot Publisher β
β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β Kafka Topic β β MinIO/S3 β β OpenSearch β β
β β error-snapshots β β (compressed) β β (indexed) β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Snapshot Modelβ
@dataclass
class MCPErrorSnapshot:
# --- Identification ---
snapshot_id: str # UUID
timestamp: datetime
environment: str # dev, staging, prod
# --- Error Details ---
error_type: MCPErrorType # Enum: TOOL_EXECUTION, RATE_LIMIT, AUTH, etc.
error_message: str # Masked
# --- Request Context ---
request: RequestContext # Method, path, headers (masked), body
response_status: int
# --- User Context ---
user: UserContext # tenant_id, user_id (hashed), roles
# --- MCP Context ---
mcp_server: MCPServerContext | None # Server ID, protocol version
tool_invocation: ToolInvocation | None # Tool name, params (masked)
llm_context: LLMContext | None # Tokens, model, estimated cost
# --- Retry Context ---
retry_context: RetryContext | None # Attempt count, backoff
# --- Observability ---
trace_id: str | None
span_id: str | None
conversation_id: str | None
# --- Cost Impact ---
total_cost_usd: float
tokens_wasted: int
# --- PII Tracking ---
masked_fields: list[str] # Fields that were masked
Error Typesβ
class MCPErrorType(Enum):
TOOL_EXECUTION = "tool_execution" # Tool failed during execution
TOOL_NOT_FOUND = "tool_not_found" # Tool doesn't exist
VALIDATION = "validation" # Input validation failed
AUTH = "auth" # Authentication/authorization
RATE_LIMIT = "rate_limit" # Rate limit exceeded
TIMEOUT = "timeout" # Backend timeout
UPSTREAM = "upstream" # Backend returned error
INTERNAL = "internal" # MCP Gateway internal error
PII Masking Strategyβ
Automatic Maskingβ
Three-layer masking before storage:
# 1. Headers
masked_headers = mask_headers(request.headers)
# Authorization: Bearer eyJ... β Authorization: [REDACTED]
# X-API-Key: sk-... β X-API-Key: [REDACTED]
# 2. Tool Parameters
masked_params = mask_tool_params(tool.input_params)
# {"password": "secret123"} β {"password": "[REDACTED]"}
# {"email": "user@example.com"} β {"email": "[REDACTED]"}
# 3. Error Messages
masked_message = mask_error_message(error_message)
# "Invalid token for user john@..." β "Invalid token for user [REDACTED]"
Sensitive Fields Configurationβ
settings = MCPSnapshotSettings(
sensitive_headers=[
"authorization",
"x-api-key",
"cookie",
"x-client-secret",
],
sensitive_params=[
"password",
"secret",
"token",
"api_key",
"email",
"phone",
"ssn",
"credit_card",
],
)
cURL Replay Generationβ
Each snapshot includes a cURL command to replay the request:
def generate_curl_command(snapshot: MCPErrorSnapshot) -> str:
"""Generate cURL command for replay (secrets replaced with placeholders)."""
cmd = f"curl -X {snapshot.request.method}"
cmd += f" '{snapshot.request.url}'"
for header, value in snapshot.request.headers.items():
if header.lower() in sensitive_headers:
cmd += f" -H '{header}: ${{YOUR_{header.upper()}_HERE}}'"
else:
cmd += f" -H '{header}: {value}'"
if snapshot.request.body:
cmd += f" -d '{json.dumps(snapshot.request.body)}'"
return cmd
Example output:
curl -X POST '${STOA_GATEWAY_URL}/tools/call' \
-H 'Authorization: ${YOUR_AUTHORIZATION_HERE}' \
-H 'Content-Type: application/json' \
-d '{"tool": "acme:payment-api:create", "arguments": {...}}'
Storage Strategyβ
Partitioningβ
Snapshots partitioned by date and tenant:
s3://stoa-error-snapshots/
βββ 2026/
β βββ 02/
β βββ 06/
β βββ tenant-acme/
β β βββ snap_abc123.json.gz
β β βββ snap_def456.json.gz
β βββ tenant-beta/
β βββ snap_xyz789.json.gz
Retentionβ
| Environment | Retention | Rationale |
|---|---|---|
| Production | 30 days | Compliance, debugging |
| Staging | 7 days | Testing |
| Development | 1 day | Cost savings |
Compressionβ
Snapshots compressed with gzip before storage:
async def store_snapshot(snapshot: MCPErrorSnapshot) -> str:
"""Store snapshot with compression."""
payload = snapshot.model_dump_json()
compressed = gzip.compress(payload.encode())
key = f"{snapshot.timestamp:%Y/%m/%d}/{snapshot.user.tenant_id}/{snapshot.snapshot_id}.json.gz"
await s3.put_object(Bucket=bucket, Key=key, Body=compressed)
return key
Capture Conditionsβ
Snapshots captured based on:
if not settings.enabled:
return None
if response_status < 400:
return None # Success β no snapshot
if 400 <= response_status < 500 and not settings.capture_on_4xx:
return None # Client error β optional
if response_status >= 500 and not settings.capture_on_5xx:
return None # Server error β usually captured
Exclusionsβ
settings = MCPSnapshotSettings(
exclude_paths=[
"/health",
"/metrics",
"/ready",
],
capture_on_4xx=True, # Capture client errors
capture_on_5xx=True, # Capture server errors
)
Cost Trackingβ
Snapshots calculate wasted resources:
if llm_context:
total_cost = llm_context.estimated_cost_usd
tokens_wasted = llm_context.tokens_input + llm_context.tokens_output
snapshot = MCPErrorSnapshot(
total_cost_usd=total_cost,
tokens_wasted=tokens_wasted,
...
)
Enables dashboards showing:
- Cost impact of errors per tenant
- Token waste trends
- Most expensive error types
Consequencesβ
Positiveβ
- Time-Travel Debugging β Reconstruct exact request state hours later
- PII-Safe β Automatic masking prevents compliance issues
- Cost Visibility β Track wasted tokens and API costs
- Replay Capability β cURL commands for reproduction
- Correlation β Links to traces via trace_id/span_id
Negativeβ
- Storage Costs β Snapshots consume S3 storage
- Capture Latency β ~5ms overhead per error
- Masking Gaps β Custom fields may need configuration
- Data Volume β High-error scenarios generate many snapshots
Mitigationsβ
| Challenge | Mitigation |
|---|---|
| Storage costs | Compression + short retention |
| Capture latency | Async publishing via Kafka |
| Masking gaps | Configurable sensitive fields |
| Data volume | Rate limiting, sampling for 4xx |
Referencesβ
- mcp-gateway/src/features/error_snapshots/
- ADR-023 β Zero Blind Spot Observability
- CAB-397 β Error Snapshot Feature
Standard Marchemalo: A 40-year veteran architect understands in 30 seconds