Skip to main content

Benchmark Methodology

This document describes the full methodology behind the Gateway Arena, STOA's continuous comparative benchmark for API gateways. For results, see Performance Benchmarks. For the architectural decision, see ADR-049.

Design Principles​

The Gateway Arena is built on four principles:

  1. Fair comparison β€” All gateways hit the same local echo backend (nginx, static JSON, <1ms response). Benchmarks measure pure gateway overhead, not backend or network latency.
  2. Statistical rigor β€” Median-of-N scoring with Student's t-distribution CI95 confidence intervals. No cherry-picked best runs.
  3. Reproducibility β€” All scripts are open source. Anyone can run the same benchmark on their own infrastructure.
  4. Open participation β€” Any gateway can be added to the Arena. Same scenarios, same scoring, same methodology.

Two-Layer Framework​

The Arena has two independent layers, each with its own CronJob, metrics, and Grafana dashboard.

PropertyLayer 0: Proxy BaselineLayer 1: Enterprise AI Readiness
MeasuresRaw HTTP proxy throughputAI-native gateway capabilities
Scenarios7 (warmup, health, sequential, burst, sustained, ramp)8 enterprise dimensions
Runs5 (1 discarded) β†’ 4 valid3 (1 discarded) β†’ 2 valid
ScheduleEvery 30 minutesEvery hour
Score range0–1000–100 (Enterprise Readiness Index)
CronJobgateway-arenagateway-arena-enterprise
Prometheus jobgateway_arenagateway_arena_enterprise

Layer 0: Proxy Baseline​

Scenarios​

Each scenario is a separate k6 invocation with different load profiles.

#ScenarioExecutorVUsIterations/DurationPurpose
0warmupshared-iterations1050 iterations, 15s maxJVM/runtime warm-up. Discarded from scoring.
1healthshared-iterations11 iteration, 10s maxHealth probe baseline
2sequentialshared-iterations120 iterations, 30s maxSingle-client P95 latency
3burst_10shared-iterations1010 iterations, 15s maxLight burst
4burst_50ramping-vus0β†’505s ramp, 10s hold, 3s ramp-downMedium burst capacity
5burst_100ramping-vus0β†’1005s ramp, 10s hold, 3s ramp-downHeavy burst capacity
6sustainedshared-iterations1100 iterations, 60s maxConsistency under steady load
7ramp_upramping-arrival-rate10β†’100 req/s60s (6 stages)Throughput scaling

Run Protocol​

For each gateway, the orchestrator (run-arena.sh) executes:

For run = 1 to 5:
For scenario in [warmup, health, sequential, burst_10, burst_50, burst_100, sustained, ramp_up]:
k6 run benchmark.js --env SCENARIO={scenario} β†’ JSON summary

Run 1 is the warm-up run and is discarded from scoring. Runs 2–5 are scored.

Scoring Formula​

The composite score is a weighted sum of 7 dimension scores:

Score = 0.10 Γ— Base
+ 0.20 Γ— Burst50
+ 0.20 Γ— Burst100
+ 0.15 Γ— Availability
+ 0.10 Γ— Error
+ 0.10 Γ— Consistency
+ 0.15 Γ— RampUp

Dimension: Base (Sequential Latency)​

Measures single-client P95 latency against a cap of 400ms.

Base = max(0, min(100, 100 Γ— (1 βˆ’ P95_sequential / 0.4)))

A P95 of 0ms scores 100. A P95 of 400ms scores 0.

Dimension: Burst50 / Burst100​

Same formula with different caps:

Burst50  = max(0, min(100, 100 Γ— (1 βˆ’ P95_burst50  / 2.5)))
Burst100 = max(0, min(100, 100 Γ— (1 βˆ’ P95_burst100 / 4.0)))
DimensionLatency CapP95 = 0P95 = CapP95 > Cap
Base400ms10000
Burst502.5s10000
Burst1004.0s10000

Dimension: Availability & Error​

Both use the same formula, based on total successful checks across all runs:

Availability = Error = 100 Γ— (total_ok / total_requests)

Where total_ok and total_requests are summed across all scenarios and all valid runs. If no requests are made, defaults to 50.

Dimension: Consistency​

Uses IQR-based coefficient of variation from the sustained scenario. This is robust to bimodal network latency (unlike standard deviation).

IQR_CV = (P75 βˆ’ P25) / P50
Consistency = max(0, min(100, 100 Γ— (1 βˆ’ IQR_CV)))

A perfectly consistent gateway (P75 = P25) scores 100. High variance scores lower.

Dimension: Ramp-Up​

Measures throughput scaling under increasing load. The ramp_up scenario uses a ramping-arrival-rate executor that pushes from 10 to 100 requests/second over 60 seconds.

effective_rate = observed_rate Γ— success_rate

If P99 exceeds 2 seconds, a penalty is applied:

if P99 > 2s:
effective_rate Γ— max(0.5, 1.0 βˆ’ (P99 βˆ’ 2.0) / 8.0)

The score is the effective rate capped at 100:

RampUp = min(100, effective_rate)

Weight Summary​

WeightDimensionWhat It Rewards
0.20Burst50Medium-scale burst handling
0.20Burst100Heavy-scale burst handling
0.15AvailabilityRequest success rate
0.15RampUpThroughput scaling
0.10BaseLow single-client latency
0.10ErrorError-free operation
0.10ConsistencyPredictable response times

Burst handling is weighted highest (40% combined) because API gateways in production face bursty traffic patterns more than sustained load.


Layer 1: Enterprise AI Readiness​

8 Dimensions​

#DimensionWeightScenarioLatency Cap
1MCP Discovery0.15ent_mcp_discovery500ms
2MCP Tool Execution0.20ent_mcp_toolcall500ms
3Auth Chain0.15ent_auth_chain1s
4Policy Engine0.15ent_policy_eval200ms
5AI Guardrails0.10ent_guardrails1s
6Rate Limiting0.10ent_quota_burst1s
7Resilience0.10ent_resilience1s
8Agent Governance0.05ent_governance2s

Per-Dimension Scoring​

Each dimension score is a weighted blend of availability and latency:

availability_score = passes / (passes + fails) Γ— 100
latency_score = max(0, min(100, 100 Γ— (1 βˆ’ P95 / cap)))
dimension_score = 0.6 Γ— availability_score + 0.4 Γ— latency_score

Availability is weighted higher (60%) because an AI gateway that doesn't support a feature scores 0 regardless of latency.

Enterprise Readiness Index (ERI)​

The composite score is the weighted sum of all dimensions:

ERI = Ξ£ (weight_i Γ— dimension_score_i)

Gateways without MCP support score 0 β€” not N/A. This is intentional: the absence of a capability is a concrete, honest measurement. Any gateway can implement MCP and re-run the benchmark.

MCP Protocol Support​

The benchmark supports two MCP protocol variants:

ProtocolConfig ValueEndpoint PatternUsed By
STOA REST"stoa"GET /mcp/capabilities, POST /mcp/tools/list, POST /mcp/tools/callSTOA Gateway
Streamable HTTP"streamable-http"POST /mcp with JSON-RPC 2.0 envelopeGravitee 4.8+

The k6 script reads MCP_PROTOCOL to switch between request formats. Both protocols are tested with the same scoring formula.


Statistical Methods​

Median Selection​

All per-scenario metrics (P50, P95, P99, pass/fail counts) are computed as the median of valid runs. Median is preferred over mean because it is robust to a single outlier run caused by GC pauses, noisy neighbors, or transient network issues.

def median(values):
s = sorted(values)
return s[len(s) // 2]

For Layer 0: 5 runs, 1 discarded β†’ 4 valid runs β†’ median of 4. For Layer 1: 3 runs, 1 discarded β†’ 2 valid runs β†’ median of 2.

CI95 Confidence Intervals​

Confidence intervals use Student's t-distribution (not z-scores) because sample sizes are small (n=2–4). The t-distribution has heavier tails, producing wider (more honest) intervals for small samples.

CI95 = mean Β± t(Ξ±/2, df) Γ— (stddev / √n)

where:
df = n βˆ’ 1 (degrees of freedom)
t(Ξ±/2, df) (t-critical value for 95% confidence)
stddev = √(Ξ£(xi βˆ’ mean)Β² / (n βˆ’ 1)) (sample standard deviation)

t-Critical Values​

dft-criticalTypical n
112.7062 runs (Layer 1)
24.3033 runs
33.1824 runs (Layer 0)
42.7765 runs
β‰₯1201.96Large samples (β‰ˆ z-score)

CI95 bounds are clamped to [0, 100] for composite scores.

Why Not z-Scores?​

With n=4 runs, using z=1.96 instead of t=3.182 would produce confidence intervals that are 50% too narrow, giving a false impression of precision. The t-distribution correctly accounts for the uncertainty inherent in small sample sizes.


Architecture​

Infrastructure​

K8s CronJob (OVH MKS, co-located):
run-arena.sh (orchestrator)
└── For each gateway (3 K8s + N VPS):
└── For each run (5):
β”œβ”€β”€ k6 run (warmup) β†’ discard
└── k6 run (7 scenarios) β†’ JSON summaries
run-arena.py (scorer)
└── Reads JSON β†’ median β†’ composite score β†’ CI95 β†’ Prometheus text
└── curl POST β†’ Pushgateway

VPS Sidecar (host cron, co-located):
Same Docker image + scripts
Benchmarks 1 local gateway β†’ pushes to Pushgateway

Echo Backend​

All gateways proxy to a shared echo server β€” an nginx container returning a static JSON payload in <1ms:

{"status": "ok", "server": "echo-k8s"}

This ensures benchmarks measure gateway overhead only, not backend performance or network latency. The echo server runs on the same cluster/host as the gateway being tested.

Current Participants​

In-Cluster (K8s β€” OVH MKS)​

GatewayProxy PortHealth EndpointBackend
STOA8080/healthecho-backend:8888
Kong DB-less8000:8001/statusecho-backend:8888
Gravitee APIM8082:18082/_node/healthecho-backend:8888

VPS (Co-located Sidecars)​

GatewayHostProxy PortBackend
STOA51.83.45.138080echo-local:8888
Kong51.83.45.138000echo-local:8888
Gravitee54.36.209.2378082echo-local:8888

Prometheus Metrics​

Layer 0​

MetricTypeLabelsDescription
gateway_arena_scoregaugegatewayComposite score (0–100)
gateway_arena_score_stddevgaugegatewayRun-to-run standard deviation
gateway_arena_score_ci_lowergaugegatewayCI95 lower bound
gateway_arena_score_ci_uppergaugegatewayCI95 upper bound
gateway_arena_runsgaugegatewayNumber of valid runs
gateway_arena_availabilitygaugegatewayHealth check success rate (0–1)
gateway_arena_p50_secondsgaugegateway, scenarioMedian latency per scenario
gateway_arena_p95_secondsgaugegateway, scenarioP95 latency per scenario
gateway_arena_p99_secondsgaugegateway, scenarioP99 latency per scenario
gateway_arena_ramp_rategaugegatewayPeak sustained req/s
gateway_arena_requests_totalgaugegateway, scenario, statusRequest counts

Each latency metric also has _ci_lower_seconds and _ci_upper_seconds variants with the same labels.

Layer 1​

MetricTypeLabelsDescription
gateway_arena_enterprise_scoregaugegatewayEnterprise Readiness Index (0–100)
gateway_arena_enterprise_dimensiongaugegateway, dimensionPer-dimension score (0–100)
gateway_arena_enterprise_score_ci_lowergaugegatewayCI95 lower bound
gateway_arena_enterprise_score_ci_uppergaugegatewayCI95 upper bound
gateway_arena_enterprise_score_stddevgaugegatewayRun-to-run standard deviation
gateway_arena_enterprise_runsgaugegatewayValid enterprise run count
gateway_arena_enterprise_latency_p95gaugegateway, dimensionP95 latency per dimension

Adding a New Gateway​

K8s (In-Cluster)​

  1. Deploy the gateway in the stoa-system namespace with a Service
  2. Configure a route to the echo backend: http://echo-backend.stoa-system.svc:8888
  3. Add an entry to the GATEWAYS JSON in k8s/arena/cronjob-prod.yaml:
{
"name": "my-gateway",
"health": "http://my-gw-svc.stoa-system.svc:PORT/health",
"proxy": "http://my-gw-svc.stoa-system.svc:PORT/echo/get"
}
  1. For Layer 1, add MCP fields:
{
"name": "my-gateway",
"target": "http://my-gw-svc:PORT",
"health": "http://my-gw-svc:PORT/health",
"mcp_base": "http://my-gw-svc:PORT/mcp",
"mcp_protocol": "stoa"
}
  1. Update the ConfigMap and run a manual benchmark:
kubectl create configmap gateway-arena-scripts \
--from-file=scripts/traffic/arena/benchmark.js \
--from-file=scripts/traffic/arena/run-arena.sh \
--from-file=scripts/traffic/arena/run-arena.py \
-n stoa-system --dry-run=client -o yaml | kubectl apply -f -

kubectl create job --from=cronjob/gateway-arena arena-test-$(date +%s) -n stoa-system

VPS (Co-located Sidecar)​

  1. Deploy the gateway on the VPS
  2. Deploy the echo container on the same Docker network
  3. Add VPS configuration to deploy/vps/bench/deploy.sh
  4. Run: ./deploy/vps/bench/deploy.sh

Verification​

After adding a gateway, verify metrics appear in Pushgateway:

curl -s https://pushgateway.gostoa.dev/metrics | grep 'gateway="my-gateway"'

Score Interpretation​

ScoreRatingTypical Cause
> 95ExcellentCo-located gateway with minimal overhead (Rust, C, optimized nginx)
80–95GoodWell-configured gateway, normal for production setups
60–80AcceptableCheck resource constraints, network hops, or JVM tuning
< 60InvestigateConnection failures, high error rates, or misconfiguration

Reading CI95 Bounds​

  • Narrow bounds (e.g., [82, 88]): stable, reproducible results
  • Wide bounds (e.g., [60, 95]): high run-to-run variance β€” investigate noisy neighbors, GC pauses, or cold-start effects
  • Bounds crossing a threshold (e.g., [78, 92] crossing 80): the gateway's true performance is ambiguous at that threshold β€” more runs would narrow the interval

Reproducibility​

All Arena scripts are in the scripts/traffic/arena/ directory of the stoa repository.

FilePurpose
benchmark.jsk6 scenario definitions (Layer 0)
benchmark-enterprise.jsk6 enterprise scenarios (Layer 1)
run-arena.shShell orchestrator (Layer 0)
run-arena-enterprise.shShell orchestrator (Layer 1)
run-arena.pyPython scorer with CI95 (Layer 0)
run-arena-enterprise.pyPython scorer with CI95 (Layer 1)
DockerfileArena image: k6 0.54.0 + jq + curl + bash + python3

Running Locally​

# Build the arena image
docker build -t arena-bench scripts/traffic/arena/

# Run Layer 0 against a local gateway
docker run --rm \
-e GATEWAYS='[{"name":"local","health":"http://host.docker.internal:8080/health","proxy":"http://host.docker.internal:8080/echo/get"}]' \
-e RUNS=3 \
-e DISCARD_FIRST=1 \
arena-bench /scripts/run-arena.sh

Running on Kubernetes​

# Layer 0 β€” one-off
kubectl create job --from=cronjob/gateway-arena arena-manual -n stoa-system
kubectl logs -n stoa-system -l job-name=arena-manual --follow

# Layer 1 β€” one-off
kubectl create job --from=cronjob/gateway-arena-enterprise arena-ent-manual -n stoa-system
kubectl logs -n stoa-system -l job-name=arena-ent-manual --follow

# Clean up
kubectl delete job arena-manual arena-ent-manual -n stoa-system

Limitations and Known Biases​

  • Network locality β€” K8s gateways run on the same cluster as k6 (co-located). VPS gateways run on the same host. Cross-network benchmarks are not comparable due to variable latency.
  • JVM warm-up β€” The warmup run mitigates cold-start bias for JVM-based gateways (Kong/OpenResty, Gravitee), but 50 iterations may not fully warm all code paths.
  • Small sample size β€” Layer 1 uses only 2 valid runs (3 total, 1 discarded). CI95 bounds with df=1 use t=12.706, producing very wide intervals. This is statistically correct but limits precision.
  • MCP scoring bias β€” Layer 1 inherently favors gateways with MCP support. This is by design: the benchmark measures AI-native capabilities. Layer 0 provides the complementary proxy baseline.
  • Echo backend assumes GET β€” All proxy scenarios use HTTP GET against the echo backend. POST/PUT patterns with request body parsing are not benchmarked.

Feature comparisons are based on tests run under identical conditions as of the date noted above. Gateway capabilities change frequently. We encourage readers to verify current performance with their own workloads. All trademarks belong to their respective owners. See trademarks.