Skip to main content

ADR-009: Error Snapshots β€” Time-Travel Debugging with PII Masking

Metadata​

FieldValue
StatusAccepted
Date2026-02-06
LinearCAB-397

Context​

When MCP tool invocations fail, debugging requires understanding the complete context:

  • What was the request?
  • What user/tenant initiated it?
  • What LLM context existed?
  • Were there retries?
  • What was the cost impact?

Traditional logging captures fragments. Developers must correlate logs across services manually. This is especially problematic for:

  • Intermittent failures
  • Rate limit breaches
  • Backend timeouts
  • Schema validation errors

The Problem​

"An AI agent failed mid-conversation. The user reports it 2 hours later. How do we reconstruct exactly what happened?"

Decision​

Implement Error Snapshots β€” complete point-in-time captures of failed requests with automatic PII masking and cURL replay generation.

Architecture​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MCP Gateway β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Tool │───▢│ Error Handler │───▢│ capture_mcp_error() β”‚ β”‚
β”‚ β”‚ Invocation β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ (fails) β”‚ β”‚ status >= 400 β”‚ β”‚ - Build context β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ - Mask PII β”‚ β”‚
β”‚ β”‚ - Calculate cost β”‚ β”‚
β”‚ β”‚ - Publish async β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Snapshot Publisher β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Kafka Topic β”‚ β”‚ MinIO/S3 β”‚ β”‚ OpenSearch β”‚ β”‚
β”‚ β”‚ error-snapshots β”‚ β”‚ (compressed) β”‚ β”‚ (indexed) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Snapshot Model​

@dataclass
class MCPErrorSnapshot:
# --- Identification ---
snapshot_id: str # UUID
timestamp: datetime
environment: str # dev, staging, prod

# --- Error Details ---
error_type: MCPErrorType # Enum: TOOL_EXECUTION, RATE_LIMIT, AUTH, etc.
error_message: str # Masked

# --- Request Context ---
request: RequestContext # Method, path, headers (masked), body
response_status: int

# --- User Context ---
user: UserContext # tenant_id, user_id (hashed), roles

# --- MCP Context ---
mcp_server: MCPServerContext | None # Server ID, protocol version
tool_invocation: ToolInvocation | None # Tool name, params (masked)
llm_context: LLMContext | None # Tokens, model, estimated cost

# --- Retry Context ---
retry_context: RetryContext | None # Attempt count, backoff

# --- Observability ---
trace_id: str | None
span_id: str | None
conversation_id: str | None

# --- Cost Impact ---
total_cost_usd: float
tokens_wasted: int

# --- PII Tracking ---
masked_fields: list[str] # Fields that were masked

Error Types​

class MCPErrorType(Enum):
TOOL_EXECUTION = "tool_execution" # Tool failed during execution
TOOL_NOT_FOUND = "tool_not_found" # Tool doesn't exist
VALIDATION = "validation" # Input validation failed
AUTH = "auth" # Authentication/authorization
RATE_LIMIT = "rate_limit" # Rate limit exceeded
TIMEOUT = "timeout" # Backend timeout
UPSTREAM = "upstream" # Backend returned error
INTERNAL = "internal" # MCP Gateway internal error

PII Masking Strategy​

Automatic Masking​

Three-layer masking before storage:

# 1. Headers
masked_headers = mask_headers(request.headers)
# Authorization: Bearer eyJ... β†’ Authorization: [REDACTED]
# X-API-Key: sk-... β†’ X-API-Key: [REDACTED]

# 2. Tool Parameters
masked_params = mask_tool_params(tool.input_params)
# {"password": "secret123"} β†’ {"password": "[REDACTED]"}
# {"email": "user@example.com"} β†’ {"email": "[REDACTED]"}

# 3. Error Messages
masked_message = mask_error_message(error_message)
# "Invalid token for user john@..." β†’ "Invalid token for user [REDACTED]"

Sensitive Fields Configuration​

settings = MCPSnapshotSettings(
sensitive_headers=[
"authorization",
"x-api-key",
"cookie",
"x-client-secret",
],
sensitive_params=[
"password",
"secret",
"token",
"api_key",
"email",
"phone",
"ssn",
"credit_card",
],
)

cURL Replay Generation​

Each snapshot includes a cURL command to replay the request:

def generate_curl_command(snapshot: MCPErrorSnapshot) -> str:
"""Generate cURL command for replay (secrets replaced with placeholders)."""
cmd = f"curl -X {snapshot.request.method}"
cmd += f" '{snapshot.request.url}'"

for header, value in snapshot.request.headers.items():
if header.lower() in sensitive_headers:
cmd += f" -H '{header}: ${{YOUR_{header.upper()}_HERE}}'"
else:
cmd += f" -H '{header}: {value}'"

if snapshot.request.body:
cmd += f" -d '{json.dumps(snapshot.request.body)}'"

return cmd

Example output:

curl -X POST '${STOA_GATEWAY_URL}/tools/call' \
-H 'Authorization: ${YOUR_AUTHORIZATION_HERE}' \
-H 'Content-Type: application/json' \
-d '{"tool": "acme:payment-api:create", "arguments": {...}}'

Storage Strategy​

Partitioning​

Snapshots partitioned by date and tenant:

s3://stoa-error-snapshots/
β”œβ”€β”€ 2026/
β”‚ └── 02/
β”‚ └── 06/
β”‚ β”œβ”€β”€ tenant-acme/
β”‚ β”‚ β”œβ”€β”€ snap_abc123.json.gz
β”‚ β”‚ └── snap_def456.json.gz
β”‚ └── tenant-beta/
β”‚ └── snap_xyz789.json.gz

Retention​

EnvironmentRetentionRationale
Production30 daysCompliance, debugging
Staging7 daysTesting
Development1 dayCost savings

Compression​

Snapshots compressed with gzip before storage:

async def store_snapshot(snapshot: MCPErrorSnapshot) -> str:
"""Store snapshot with compression."""
payload = snapshot.model_dump_json()
compressed = gzip.compress(payload.encode())

key = f"{snapshot.timestamp:%Y/%m/%d}/{snapshot.user.tenant_id}/{snapshot.snapshot_id}.json.gz"
await s3.put_object(Bucket=bucket, Key=key, Body=compressed)

return key

Capture Conditions​

Snapshots captured based on:

if not settings.enabled:
return None

if response_status < 400:
return None # Success β€” no snapshot

if 400 <= response_status < 500 and not settings.capture_on_4xx:
return None # Client error β€” optional

if response_status >= 500 and not settings.capture_on_5xx:
return None # Server error β€” usually captured

Exclusions​

settings = MCPSnapshotSettings(
exclude_paths=[
"/health",
"/metrics",
"/ready",
],
capture_on_4xx=True, # Capture client errors
capture_on_5xx=True, # Capture server errors
)

Cost Tracking​

Snapshots calculate wasted resources:

if llm_context:
total_cost = llm_context.estimated_cost_usd
tokens_wasted = llm_context.tokens_input + llm_context.tokens_output

snapshot = MCPErrorSnapshot(
total_cost_usd=total_cost,
tokens_wasted=tokens_wasted,
...
)

Enables dashboards showing:

  • Cost impact of errors per tenant
  • Token waste trends
  • Most expensive error types

Consequences​

Positive​

  • Time-Travel Debugging β€” Reconstruct exact request state hours later
  • PII-Safe β€” Automatic masking prevents compliance issues
  • Cost Visibility β€” Track wasted tokens and API costs
  • Replay Capability β€” cURL commands for reproduction
  • Correlation β€” Links to traces via trace_id/span_id

Negative​

  • Storage Costs β€” Snapshots consume S3 storage
  • Capture Latency β€” ~5ms overhead per error
  • Masking Gaps β€” Custom fields may need configuration
  • Data Volume β€” High-error scenarios generate many snapshots

Mitigations​

ChallengeMitigation
Storage costsCompression + short retention
Capture latencyAsync publishing via Kafka
Masking gapsConfigurable sensitive fields
Data volumeRate limiting, sampling for 4xx

References​


Standard Marchemalo: A 40-year veteran architect understands in 30 seconds