Benchmark Methodology

This document describes the full methodology behind the Gateway Arena, STOA's continuous comparative benchmark for API gateways. For results, see Performance Benchmarks. For the architectural decision, see ADR-049.

Design Principles

The Gateway Arena is built on four principles:

Fair comparison — All gateways hit the same local echo backend (nginx, static JSON, <1ms response). Benchmarks measure pure gateway overhead, not backend or network latency.
Statistical rigor — Median-of-N scoring with Student's t-distribution CI95 confidence intervals. No cherry-picked best runs.
Reproducibility — All scripts are open source. Anyone can run the same benchmark on their own infrastructure.
Open participation — Any gateway can be added to the Arena. Same scenarios, same scoring, same methodology.

Two-Layer Framework

The Arena has two independent layers, each with its own CronJob, metrics, and Grafana dashboard.

Property	Layer 0: Proxy Baseline	Layer 1: Enterprise AI Readiness
Measures	Raw HTTP proxy throughput	AI-native gateway capabilities
Scenarios	7 (warmup, health, sequential, burst, sustained, ramp)	8 enterprise dimensions
Runs	5 (1 discarded) → 4 valid	3 (1 discarded) → 2 valid
Schedule	Every 30 minutes	Every hour
Score range	0–100	0–100 (Enterprise Readiness Index)
CronJob	`gateway-arena`	`gateway-arena-enterprise`
Prometheus job	`gateway_arena`	`gateway_arena_enterprise`

Layer 0: Proxy Baseline

Scenarios

Each scenario is a separate k6 invocation with different load profiles.

#	Scenario	Executor	VUs	Iterations/Duration	Purpose
0	`warmup`	shared-iterations	10	50 iterations, 15s max	JVM/runtime warm-up. Discarded from scoring.
1	`health`	shared-iterations	1	1 iteration, 10s max	Health probe baseline
2	`sequential`	shared-iterations	1	20 iterations, 30s max	Single-client P95 latency
3	`burst_10`	shared-iterations	10	10 iterations, 15s max	Light burst
4	`burst_50`	ramping-vus	0→50	5s ramp, 10s hold, 3s ramp-down	Medium burst capacity
5	`burst_100`	ramping-vus	0→100	5s ramp, 10s hold, 3s ramp-down	Heavy burst capacity
6	`sustained`	shared-iterations	1	100 iterations, 60s max	Consistency under steady load
7	`ramp_up`	ramping-arrival-rate	10→100 req/s	60s (6 stages)	Throughput scaling

Run Protocol

For each gateway, the orchestrator (run-arena.sh) executes:

For run = 1 to 5:
  For scenario in [warmup, health, sequential, burst_10, burst_50, burst_100, sustained, ramp_up]:
    k6 run benchmark.js --env SCENARIO={scenario} → JSON summary

Run 1 is the warm-up run and is discarded from scoring. Runs 2–5 are scored.

Scoring Formula

The composite score is a weighted sum of 7 dimension scores:

Score = 0.10 × Base
      + 0.20 × Burst50
      + 0.20 × Burst100
      + 0.15 × Availability
      + 0.10 × Error
      + 0.10 × Consistency
      + 0.15 × RampUp

Dimension: Base (Sequential Latency)

Measures single-client P95 latency against a cap of 400ms.

Base = max(0, min(100, 100 × (1 − P95_sequential / 0.4)))

A P95 of 0ms scores 100. A P95 of 400ms scores 0.

Dimension: Burst50 / Burst100

Same formula with different caps:

Burst50  = max(0, min(100, 100 × (1 − P95_burst50  / 2.5)))
Burst100 = max(0, min(100, 100 × (1 − P95_burst100 / 4.0)))

Dimension	Latency Cap	P95 = 0
Base	400ms	100
Burst50	2.5s	100
Burst100	4.0s	100

Dimension: Availability & Error

Both use the same formula, based on total successful checks across all runs:

Availability = Error = 100 × (total_ok / total_requests)

Where total_ok and total_requests are summed across all scenarios and all valid runs. If no requests are made, defaults to 50.

Dimension: Consistency

Uses IQR-based coefficient of variation from the sustained scenario. This is robust to bimodal network latency (unlike standard deviation).

IQR_CV = (P75 − P25) / P50
Consistency = max(0, min(100, 100 × (1 − IQR_CV)))

A perfectly consistent gateway (P75 = P25) scores 100. High variance scores lower.

Dimension: Ramp-Up

Measures throughput scaling under increasing load. The ramp_up scenario uses a ramping-arrival-rate executor that pushes from 10 to 100 requests/second over 60 seconds.

effective_rate = observed_rate × success_rate

If P99 exceeds 2 seconds, a penalty is applied:

if P99 > 2s:
  effective_rate × max(0.5, 1.0 − (P99 − 2.0) / 8.0)

The score is the effective rate capped at 100:

RampUp = min(100, effective_rate)

Weight Summary

Weight	Dimension	What It Rewards
0.20	Burst50	Medium-scale burst handling
0.20	Burst100	Heavy-scale burst handling
0.15	Availability	Request success rate
0.15	RampUp	Throughput scaling
0.10	Base	Low single-client latency
0.10	Error	Error-free operation
0.10	Consistency	Predictable response times

Burst handling is weighted highest (40% combined) because API gateways in production face bursty traffic patterns more than sustained load.

Layer 1: Enterprise AI Readiness

8 Dimensions

#	Dimension	Weight	Scenario	Latency Cap
1	MCP Discovery	0.15	`ent_mcp_discovery`	500ms
2	MCP Tool Execution	0.20	`ent_mcp_toolcall`	500ms
3	Auth Chain	0.15	`ent_auth_chain`	1s
4	Policy Engine	0.15	`ent_policy_eval`	200ms
5	AI Guardrails	0.10	`ent_guardrails`	1s
6	Rate Limiting	0.10	`ent_quota_burst`	1s
7	Resilience	0.10	`ent_resilience`	1s
8	Agent Governance	0.05	`ent_governance`	2s

Per-Dimension Scoring

Each dimension score is a weighted blend of availability and latency:

availability_score = passes / (passes + fails) × 100
latency_score     = max(0, min(100, 100 × (1 − P95 / cap)))
dimension_score   = 0.6 × availability_score + 0.4 × latency_score

Availability is weighted higher (60%) because an AI gateway that doesn't support a feature scores 0 regardless of latency.

Enterprise Readiness Index (ERI)

The composite score is the weighted sum of all dimensions:

ERI = Σ (weight_i × dimension_score_i)

Gateways without MCP support score 0 — not N/A. This is intentional: the absence of a capability is a concrete, honest measurement. Any gateway can implement MCP and re-run the benchmark.

MCP Protocol Support

The benchmark supports two MCP protocol variants:

Protocol	Config Value	Endpoint Pattern	Used By
STOA REST	`"stoa"`	`GET /mcp/capabilities`, `POST /mcp/tools/list`, `POST /mcp/tools/call`	STOA Gateway
Streamable HTTP	`"streamable-http"`	`POST /mcp` with JSON-RPC 2.0 envelope	Gravitee 4.8+

The k6 script reads MCP_PROTOCOL to switch between request formats. Both protocols are tested with the same scoring formula.

Statistical Methods

Median Selection

All per-scenario metrics (P50, P95, P99, pass/fail counts) are computed as the median of valid runs. Median is preferred over mean because it is robust to a single outlier run caused by GC pauses, noisy neighbors, or transient network issues.

def median(values):
    s = sorted(values)
    return s[len(s) // 2]

For Layer 0: 5 runs, 1 discarded → 4 valid runs → median of 4. For Layer 1: 3 runs, 1 discarded → 2 valid runs → median of 2.

CI95 Confidence Intervals

Confidence intervals use Student's t-distribution (not z-scores) because sample sizes are small (n=2–4). The t-distribution has heavier tails, producing wider (more honest) intervals for small samples.

CI95 = mean ± t(α/2, df) × (stddev / √n)

where:
  df = n − 1               (degrees of freedom)
  t(α/2, df)               (t-critical value for 95% confidence)
  stddev = √(Σ(xi − mean)² / (n − 1))  (sample standard deviation)

t-Critical Values

df	t-critical	Typical n
1	12.706	2 runs (Layer 1)
2	4.303	3 runs
3	3.182	4 runs (Layer 0)
4	2.776	5 runs
≥120	1.96	Large samples (≈ z-score)

CI95 bounds are clamped to [0, 100] for composite scores.

Why Not z-Scores?

With n=4 runs, using z=1.96 instead of t=3.182 would produce confidence intervals that are 50% too narrow, giving a false impression of precision. The t-distribution correctly accounts for the uncertainty inherent in small sample sizes.

Architecture

Infrastructure

K8s CronJob (OVH MKS, co-located):
  run-arena.sh (orchestrator)
    └── For each gateway (3 K8s + N VPS):
        └── For each run (5):
            ├── k6 run (warmup) → discard
            └── k6 run (7 scenarios) → JSON summaries
  run-arena.py (scorer)
    └── Reads JSON → median → composite score → CI95 → Prometheus text
    └── curl POST → Pushgateway

VPS Sidecar (host cron, co-located):
  Same Docker image + scripts
  Benchmarks 1 local gateway → pushes to Pushgateway

Echo Backend

All gateways proxy to a shared echo server — an nginx container returning a static JSON payload in <1ms:

{"status": "ok", "server": "echo-k8s"}

This ensures benchmarks measure gateway overhead only, not backend performance or network latency. The echo server runs on the same cluster/host as the gateway being tested.

Current Participants

In-Cluster (K8s — OVH MKS)

Gateway	Proxy Port	Health Endpoint	Backend
STOA	8080	`/health`	echo-backend:8888
Kong DB-less	8000	`:8001/status`	echo-backend:8888
Gravitee APIM	8082	`:18082/_node/health`	echo-backend:8888

VPS (Co-located Sidecars)

Gateway	Host	Proxy Port	Backend
STOA	51.83.45.13	8080	echo-local:8888
Kong	51.83.45.13	8000	echo-local:8888
Gravitee	54.36.209.237	8082	echo-local:8888

Prometheus Metrics

Layer 0

Metric	Type	Labels	Description
`gateway_arena_score`	gauge	`gateway`	Composite score (0–100)
`gateway_arena_score_stddev`	gauge	`gateway`	Run-to-run standard deviation
`gateway_arena_score_ci_lower`	gauge	`gateway`	CI95 lower bound
`gateway_arena_score_ci_upper`	gauge	`gateway`	CI95 upper bound
`gateway_arena_runs`	gauge	`gateway`	Number of valid runs
`gateway_arena_availability`	gauge	`gateway`	Health check success rate (0–1)
`gateway_arena_p50_seconds`	gauge	`gateway`, `scenario`	Median latency per scenario
`gateway_arena_p95_seconds`	gauge	`gateway`, `scenario`	P95 latency per scenario
`gateway_arena_p99_seconds`	gauge	`gateway`, `scenario`	P99 latency per scenario
`gateway_arena_ramp_rate`	gauge	`gateway`	Peak sustained req/s
`gateway_arena_requests_total`	gauge	`gateway`, `scenario`, `status`	Request counts

Each latency metric also has _ci_lower_seconds and _ci_upper_seconds variants with the same labels.

Layer 1

Metric	Type	Labels	Description
`gateway_arena_enterprise_score`	gauge	`gateway`	Enterprise Readiness Index (0–100)
`gateway_arena_enterprise_dimension`	gauge	`gateway`, `dimension`	Per-dimension score (0–100)
`gateway_arena_enterprise_score_ci_lower`	gauge	`gateway`	CI95 lower bound
`gateway_arena_enterprise_score_ci_upper`	gauge	`gateway`	CI95 upper bound
`gateway_arena_enterprise_score_stddev`	gauge	`gateway`	Run-to-run standard deviation
`gateway_arena_enterprise_runs`	gauge	`gateway`	Valid enterprise run count
`gateway_arena_enterprise_latency_p95`	gauge	`gateway`, `dimension`	P95 latency per dimension

Adding a New Gateway

K8s (In-Cluster)

Deploy the gateway in the stoa-system namespace with a Service
Configure a route to the echo backend: http://echo-backend.stoa-system.svc:8888
Add an entry to the GATEWAYS JSON in k8s/arena/cronjob-prod.yaml:

{
  "name": "my-gateway",
  "health": "http://my-gw-svc.stoa-system.svc:PORT/health",
  "proxy": "http://my-gw-svc.stoa-system.svc:PORT/echo/get"
}

For Layer 1, add MCP fields:

{
  "name": "my-gateway",
  "target": "http://my-gw-svc:PORT",
  "health": "http://my-gw-svc:PORT/health",
  "mcp_base": "http://my-gw-svc:PORT/mcp",
  "mcp_protocol": "stoa"
}

Update the ConfigMap and run a manual benchmark:

kubectl create configmap gateway-arena-scripts \
  --from-file=scripts/traffic/arena/benchmark.js \
  --from-file=scripts/traffic/arena/run-arena.sh \
  --from-file=scripts/traffic/arena/run-arena.py \
  -n stoa-system --dry-run=client -o yaml | kubectl apply -f -

kubectl create job --from=cronjob/gateway-arena arena-test-$(date +%s) -n stoa-system

VPS (Co-located Sidecar)

Deploy the gateway on the VPS
Deploy the echo container on the same Docker network
Add VPS configuration to deploy/vps/bench/deploy.sh
Run: ./deploy/vps/bench/deploy.sh

Verification

After adding a gateway, verify metrics appear in Pushgateway:

curl -s https://pushgateway.gostoa.dev/metrics | grep 'gateway="my-gateway"'

Score Interpretation

Score	Rating	Typical Cause
> 95	Excellent	Co-located gateway with minimal overhead (Rust, C, optimized nginx)
80–95	Good	Well-configured gateway, normal for production setups
60–80	Acceptable	Check resource constraints, network hops, or JVM tuning
< 60	Investigate	Connection failures, high error rates, or misconfiguration

Reading CI95 Bounds

Narrow bounds (e.g., [82, 88]): stable, reproducible results
Wide bounds (e.g., [60, 95]): high run-to-run variance — investigate noisy neighbors, GC pauses, or cold-start effects
Bounds crossing a threshold (e.g., [78, 92] crossing 80): the gateway's true performance is ambiguous at that threshold — more runs would narrow the interval

Reproducibility

All Arena scripts are in the scripts/traffic/arena/ directory of the stoa repository.

File	Purpose
`benchmark.js`	k6 scenario definitions (Layer 0)
`benchmark-enterprise.js`	k6 enterprise scenarios (Layer 1)
`run-arena.sh`	Shell orchestrator (Layer 0)
`run-arena-enterprise.sh`	Shell orchestrator (Layer 1)
`run-arena.py`	Python scorer with CI95 (Layer 0)
`run-arena-enterprise.py`	Python scorer with CI95 (Layer 1)
`Dockerfile`	Arena image: k6 0.54.0 + jq + curl + bash + python3

Running Locally

# Build the arena image
docker build -t arena-bench scripts/traffic/arena/

# Run Layer 0 against a local gateway
docker run --rm \
  -e GATEWAYS='[{"name":"local","health":"http://host.docker.internal:8080/health","proxy":"http://host.docker.internal:8080/echo/get"}]' \
  -e RUNS=3 \
  -e DISCARD_FIRST=1 \
  arena-bench /scripts/run-arena.sh

Running on Kubernetes

# Layer 0 — one-off
kubectl create job --from=cronjob/gateway-arena arena-manual -n stoa-system
kubectl logs -n stoa-system -l job-name=arena-manual --follow

# Layer 1 — one-off
kubectl create job --from=cronjob/gateway-arena-enterprise arena-ent-manual -n stoa-system
kubectl logs -n stoa-system -l job-name=arena-ent-manual --follow

# Clean up
kubectl delete job arena-manual arena-ent-manual -n stoa-system

Limitations and Known Biases

Network locality — K8s gateways run on the same cluster as k6 (co-located). VPS gateways run on the same host. Cross-network benchmarks are not comparable due to variable latency.
JVM warm-up — The warmup run mitigates cold-start bias for JVM-based gateways (Kong/OpenResty, Gravitee), but 50 iterations may not fully warm all code paths.
Small sample size — Layer 1 uses only 2 valid runs (3 total, 1 discarded). CI95 bounds with df=1 use t=12.706, producing very wide intervals. This is statistically correct but limits precision.
MCP scoring bias — Layer 1 inherently favors gateways with MCP support. This is by design: the benchmark measures AI-native capabilities. Layer 0 provides the complementary proxy baseline.
Echo backend assumes GET — All proxy scenarios use HTTP GET against the echo backend. POST/PUT patterns with request body parsing are not benchmarked.

Feature comparisons are based on tests run under identical conditions as of the date noted above. Gateway capabilities change frequently. We encourage readers to verify current performance with their own workloads. All trademarks belong to their respective owners. See trademarks.

Design Principles​

Two-Layer Framework​

Layer 0: Proxy Baseline​

Scenarios​

Run Protocol​

Scoring Formula​

Dimension: Base (Sequential Latency)​

Dimension: Burst50 / Burst100​

Dimension: Availability & Error​

Dimension: Consistency​

Dimension: Ramp-Up​

Weight Summary​

Layer 1: Enterprise AI Readiness​

8 Dimensions​

Per-Dimension Scoring​

Enterprise Readiness Index (ERI)​

MCP Protocol Support​

Statistical Methods​

Median Selection​

CI95 Confidence Intervals​

t-Critical Values​

Why Not z-Scores?​

Architecture​

Infrastructure​

Echo Backend​

Current Participants​

In-Cluster (K8s — OVH MKS)​

VPS (Co-located Sidecars)​

Prometheus Metrics​

Layer 0​

Layer 1​

Adding a New Gateway​

K8s (In-Cluster)​

VPS (Co-located Sidecar)​

Verification​

Score Interpretation​

Reading CI95 Bounds​

Reproducibility​

Running Locally​

Running on Kubernetes​

Limitations and Known Biases​