Audit Trails When Things Go Wrong (Freelancer Part 3)
Three weeks after launch, a client emails: "Our data looks wrong. Something happened last Tuesday at 3am."
You have two possible answers:
- "Let me check the logs" β with an audit trail
- "I don't know, I'll have to investigate" β without one
This is Part 3. We'll build the audit capability that lets you answer the first way.
This is Part 3 of the Freelancer API Security Series. Part 1: Your APIs Are More Vulnerable Than You Think | Part 2: Rate Limiting Strategies That Actually Work
What an Audit Trail Actually Isβ
An audit trail is not just access logs. It's the structured record that answers:
- Who made a request (consumer identity, not just IP)
- What they requested (method, path, request body summary)
- When it happened (timestamp, time zone)
- What happened (response code, latency, error message)
- What changed (before/after for mutations)
Access logs give you WHO and WHEN. A proper audit trail gives you all five.
The difference matters when your client asks "who deleted our API key?" β access logs might tell you that a DELETE request happened at 3am. An audit trail tells you which user, from which session, with what credentials.
Layer 1: Gateway-Level Logging (Automatic)β
STOA's gateway logs every proxied request automatically. No configuration needed. Every request through the gateway generates a structured audit event:
{
"event_type": "api_request",
"timestamp": "2026-03-11T03:17:42.123Z",
"request_id": "req_abc123def456",
"consumer_id": "cons_xyz789",
"consumer_name": "my-saas-client",
"tenant_id": "tenant_abc",
"api_id": "api_123",
"method": "POST",
"path": "/api/documents/export",
"status_code": 200,
"latency_ms": 342,
"request_size_bytes": 1024,
"response_size_bytes": 45680,
"gateway_version": "0.1.0",
"datacenter": "fra1"
}
Query it:
# Last 50 requests
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID?limit=50" \
-H "Authorization: Bearer $TOKEN" | jq '.logs[] | {timestamp, consumer_name, method, path, status_code, latency_ms}'
# All requests from a specific consumer
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID?consumer_id=$CONSUMER_ID&limit=100" \
-H "Authorization: Bearer $TOKEN" | jq .
# All 5xx errors in last 24h
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID?status_min=500&hours=24" \
-H "Authorization: Bearer $TOKEN" | jq '.logs | length'
# Export as CSV for analysis
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID/export/csv?hours=168" \
-H "Authorization: Bearer $TOKEN" > last-7-days.csv
This is your first-line audit capability β zero config, always on.
Layer 2: Security Event Loggingβ
Beyond request logs, STOA tracks security-relevant events separately:
# Security events: key rotations, failed auth, rate limit hits, policy changes
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID/security?hours=24" \
-H "Authorization: Bearer $TOKEN" | jq '.events[] | {type, timestamp, actor, target, outcome}'
Security events include:
| Event Type | What Triggers It | Why It Matters |
|---|---|---|
auth_failed | Invalid or expired API key | Brute force attempts, leaked keys |
rate_limit_exceeded | Consumer hits rate limit | Scraping, runaway jobs |
policy_applied | Policy blocked a request | Payload too large, forbidden path |
key_rotated | API key was rotated | Credential rotation audit |
key_revoked | API key was revoked | Deprovisioning, compromise response |
consumer_suspended | Consumer was suspended | Abuse response |
admin_action | Admin changed configuration | Change audit trail |
Filter by type:
# Auth failures in last 24h β potential key abuse
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID/security?event_type=auth_failed&hours=24" \
-H "Authorization: Bearer $TOKEN" | jq '{
total_failures: (.events | length),
by_consumer: (.events | group_by(.target) | map({consumer: .[0].target, count: length}) | sort_by(.count) | reverse | .[0:5])
}'
If you see 5,000 auth failures from a single consumer in 24 hours, that's a compromised key being tested. Revoke it immediately.
Layer 3: Application-Level Logging (What You Build)β
The gateway logs the transport layer. You need to log the business layer. The two are complementary.
What the gateway logs: Who called what, when, with what result.
What your application should log: What business operation happened, what data changed, why.
A good application-level audit log:
import logging
import json
from datetime import datetime, UTC
audit_logger = logging.getLogger("audit")
def log_audit_event(
action: str,
resource_type: str,
resource_id: str,
actor_id: str,
changes: dict | None = None,
metadata: dict | None = None,
):
"""Structured audit log entry for business events."""
event = {
"timestamp": datetime.now(UTC).isoformat(),
"action": action, # "created", "updated", "deleted", "accessed"
"resource_type": resource_type, # "document", "user", "invoice"
"resource_id": resource_id,
"actor_id": actor_id, # From the API key / JWT claim
"changes": changes, # {"field": {"before": X, "after": Y}}
"metadata": metadata, # Any extra context
}
audit_logger.info(json.dumps(event))
# Usage in a route handler:
@app.delete("/api/documents/{doc_id}")
async def delete_document(doc_id: str, consumer: Consumer = Depends(get_consumer)):
doc = await get_document(doc_id, consumer)
# Log before deletion β you won't be able to log after
log_audit_event(
action="deleted",
resource_type="document",
resource_id=doc_id,
actor_id=consumer.id,
metadata={
"document_name": doc.name,
"document_size_bytes": doc.size,
"reason": "user_requested"
}
)
await db.delete(doc)
return {"status": "deleted"}
Key rule for mutations: Log before you execute the operation, not after. If the operation fails, you still have the audit entry. If you log after and the service crashes, the event is lost.
Layer 4: Structured Query Patternsβ
Raw logs are useless without the ability to query them. Here are the patterns you'll actually use.
Pattern 1: Timeline Reconstructionβ
"What happened between 3am and 4am on Tuesday?"
# Gateway logs for the window
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID?from=2026-03-11T03:00:00Z&to=2026-03-11T04:00:00Z&limit=500" \
-H "Authorization: Bearer $TOKEN" | jq '.logs | group_by(.path) | map({
path: .[0].path,
total_requests: length,
errors: map(select(.status_code >= 400)) | length,
consumers: [.[].consumer_name] | unique
})'
This gives you a breakdown by endpoint: how many calls, how many errors, which consumers.
Pattern 2: Consumer Activity Profileβ
"What does 'user-12345' actually do in the API?"
CONSUMER_ID="cons_xyz789"
START="2026-03-01T00:00:00Z"
END="2026-03-15T23:59:59Z"
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID?consumer_id=$CONSUMER_ID&from=$START&to=$END&limit=1000" \
-H "Authorization: Bearer $TOKEN" | jq '{
total_requests: (.logs | length),
methods: (.logs | group_by(.method) | map({method: .[0].method, count: length})),
top_paths: (.logs | group_by(.path) | map({path: .[0].path, count: length}) | sort_by(.count) | reverse | .[0:10]),
error_rate: (.logs | map(select(.status_code >= 400)) | length) / (.logs | length) * 100,
peak_hour: (.logs | group_by(.timestamp[0:13]) | map({hour: .[0].timestamp[0:13], count: length}) | sort_by(.count) | last)
}'
This profile answers: is this consumer using the API as intended? Are they hitting paths they shouldn't? Is their error rate unusually high?
Pattern 3: Anomaly Detectionβ
"Has anything unusual happened recently?"
# Find consumers with unusually high error rates
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID?hours=24&limit=5000" \
-H "Authorization: Bearer $TOKEN" | jq '.logs | group_by(.consumer_id) | map({
consumer: .[0].consumer_name,
total: length,
errors: map(select(.status_code >= 400)) | length,
error_rate: (map(select(.status_code >= 400)) | length) / length * 100
}) | map(select(.error_rate > 10)) | sort_by(.error_rate) | reverse'
A consumer with >10% error rate is either:
- Integrating incorrectly (innocent β reach out and help them)
- Testing attack patterns (not innocent β investigate)
The distinction is usually clear from the error types: 400 errors are usually bad integration; 401/403 errors in volume are usually probing.
Pattern 4: Data Export for Complianceβ
Some clients or industries require audit trail exports (GDPR requests, SOC 2 audits, customer incident investigation):
# Export full 30-day audit for a specific consumer
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID/export/json?consumer_id=$CONSUMER_ID&days=30" \
-H "Authorization: Bearer $TOKEN" > audit-export-$(date +%Y%m%d).json
# CSV format for spreadsheet analysis
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID/export/csv?consumer_id=$CONSUMER_ID&days=30" \
-H "Authorization: Bearer $TOKEN" > audit-export-$(date +%Y%m%d).csv
Minimum Viable Incident Responseβ
You're a solo developer. You don't have a SIEM, a SOC, or an incident response team. Here's what you DO have, and how to use it.
The 5-Minute Investigation Scriptβ
When something goes wrong, run this:
#!/bin/bash
# investigate.sh β quick API incident triage
# Usage: ./investigate.sh [consumer_id] [hours=24]
CONSUMER_ID="${1:-}"
HOURS="${2:-24}"
echo "=== API Incident Triage (last ${HOURS}h) ==="
echo "Timestamp: $(date -u)"
echo ""
# 1. Recent error rate
echo "--- Error Rate ---"
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID?hours=$HOURS&limit=5000" \
-H "Authorization: Bearer $TOKEN" | jq '{
total: (.logs | length),
errors_4xx: (.logs | map(select(.status_code >= 400 and .status_code < 500)) | length),
errors_5xx: (.logs | map(select(.status_code >= 500)) | length),
rate_limited: (.logs | map(select(.status_code == 429)) | length)
}'
# 2. Top consumers by volume
echo ""
echo "--- Top Consumers (${HOURS}h) ---"
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID?hours=$HOURS&limit=5000" \
-H "Authorization: Bearer $TOKEN" | jq '.logs | group_by(.consumer_id) | map({consumer: .[0].consumer_name, requests: length}) | sort_by(.requests) | reverse | .[0:5]'
# 3. Security events
echo ""
echo "--- Security Events (${HOURS}h) ---"
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID/security?hours=$HOURS" \
-H "Authorization: Bearer $TOKEN" | jq '.events | group_by(.event_type) | map({type: .[0].event_type, count: length})'
# 4. If consumer specified, their activity
if [ -n "$CONSUMER_ID" ]; then
echo ""
echo "--- Consumer $CONSUMER_ID Activity ---"
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID?consumer_id=$CONSUMER_ID&hours=$HOURS&limit=500" \
-H "Authorization: Bearer $TOKEN" | jq '{
total_requests: (.logs | length),
error_rate: ((.logs | map(select(.status_code >= 400)) | length) / (.logs | length) * 100 | floor),
top_paths: (.logs | group_by(.path) | map({path: .[0].path, count: length}) | sort_by(.count) | reverse | .[0:5])
}'
fi
This 5-minute script answers the most common "what happened?" questions without deep investigation.
Incident Response Playbookβ
Scenario: Possible data breach
- Run
./investigate.sh "" 48β get last 48h overview - Check for unusual consumers: large request volumes, high error rates, access to sensitive paths
- Export logs for the suspicious window:
GET /v1/audit/$TENANT_ID/export/json?from=...&to=... - Cross-reference with application logs (Layer 3)
- If confirmed breach: revoke affected keys immediately (
POST /v1/subscriptions/$ID/revoke) - Notify affected consumers (GDPR: 72h notification requirement in EU)
- Document the timeline from the audit trail for your notification
Scenario: Client reports data corruption
- Ask for timestamp and affected resource ID
GET /v1/audit/$TENANT_ID?hours=72&consumer_id=$THEIR_CONSUMER_IDβ their request history- Look for the mutation event (DELETE, PUT, POST) near the reported timestamp
- Cross-reference with application audit logs (Layer 3) for the
before/aftervalues - Answer: "At 03:17:42 UTC, a DELETE request was made from your API key for resource /documents/12345. Here is the audit record."
Scenario: Unexpected API cost spike
- Check daily volume:
GET /v1/quotas/$TENANT_ID/stats - Find the top consumers by volume over the spike period
- Check if any consumer had a runaway job (sustained high request rate with no pause)
- Apply daily volume cap if not already configured (see Part 2)
Building the Habitβ
An audit trail is only useful if you check it regularly. Build it into your weekly routine:
# Add to your weekly-stoa-review.sh (from the Week 1 runbook)
echo ""
echo "--- Security Events This Week ---"
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID/security?hours=168" \
-H "Authorization: Bearer $TOKEN" | jq '{
auth_failures: (.events | map(select(.event_type == "auth_failed")) | length),
rate_limit_hits: (.events | map(select(.event_type == "rate_limit_exceeded")) | length),
key_rotations: (.events | map(select(.event_type == "key_rotated")) | length),
admin_actions: (.events | map(select(.event_type == "admin_action")) | length)
}'
echo ""
echo "--- Error Rate Trend (by day) ---"
curl -s "${STOA_API_URL}/v1/audit/$TENANT_ID?hours=168&limit=10000" \
-H "Authorization: Bearer $TOKEN" | jq '.logs | group_by(.timestamp[0:10]) | map({
date: .[0].timestamp[0:10],
total: length,
errors: map(select(.status_code >= 400)) | length,
error_pct: (map(select(.status_code >= 400)) | length) / length * 100 | floor
})'
Five minutes, once a week. Most problems are visible in the trend data before they become incidents.
The Compliance Angleβ
If you're building for European clients, healthcare, or fintech β you may have compliance requirements around audit trails.
GDPR Article 30 requires a record of processing activities. An API audit trail that captures who accessed what personal data, when, and for what purpose is a significant part of your GDPR documentation.
DORA (Digital Operational Resilience Act) for financial services requires ICT incident logging and reporting. Your gateway audit trail provides the ICT event log.
SOC 2 Type II (if you're selling to enterprise) requires demonstrating audit controls. Your audit trail evidence (exported logs) is a direct artifact for the audit.
You don't need to configure anything extra for these. The audit trail described in this article is the artifact. The habit of checking it weekly, the ability to export it on demand, and the incident response playbook above are what turn the logs into compliance evidence.
FAQβ
How long should I keep audit logs?β
GDPR: data access logs should be kept for as long as you process that data (typically 12-24 months minimum). DORA: 5 years for ICT incident logs. Pragmatic recommendation: 90 days hot (queryable via API), 12 months cold (archived CSV exports).
Can I use STOA's audit trail as my sole compliance record?β
It covers the gateway layer. For full compliance, combine it with application-level audit logs (Layer 3). The gateway tells you "who called what." Your app tells you "what changed as a result."
My client wants a self-service audit download. How do I do that?β
Expose a secure admin endpoint in your app that calls GET /v1/audit/$TENANT_ID/export/csv?consumer_id={their_id} and returns the CSV. The client gets their own activity records; they don't access other consumers' data.
What's the difference between access logs and audit logs?β
Access logs = raw request/response records (Apache/nginx format). Audit logs = structured business event records. STOA provides both. This article focuses on audit logs because they're what matters for security investigation and compliance.
Can I forward STOA's audit events to my existing SIEM?β
STOA emits audit events to a Kafka topic (if Kafka is configured). You can consume from that topic and forward to Datadog, Splunk, OpenSearch, or any SIEM. For most freelancer setups, the API query approach in this article is sufficient without a full SIEM pipeline.
You've Completed the Seriesβ
Part 1 covered the threat landscape and the 80/20 of what to protect against.
Part 2 built the rate limiting strategy that stops the most common attacks.
Part 3 (this article) built the audit capability that lets you investigate when something slips through.
The combination β gateway security baseline + tiered rate limiting + structured audit trail β is what production API security looks like for solo developers and small teams. No enterprise security team required.
Next steps:
- Developer Portal Guide β let clients self-register and manage their own keys
- Authentication Guide β JWT, OAuth 2.0, and mTLS options for higher-security scenarios
- Observability Guide β full Prometheus + Grafana setup for uptime monitoring