A purpose-built, scenario-driven load generator for LLM gateways — any OpenAI- or Anthropic-compatible endpoint. Decoupled from the system under test (zero DB dependency). Designed to scale to tens of thousands of concurrent virtual users / requests from one host.
The goal is to simulate real traffic — a weighted blend of short lookups, multi-turn chats, and heavy long-context conversations — not to fire N identical requests. Realistic shape is what makes the numbers mean something. It models the things a generic HTTP tool ignores: multi-turn conversations (growing context), streaming with time-to-first-token (TTFT) and inter-token latency (ITL), token throughput, and response/prompt caching that silently distorts results.
Full design rationale: DESIGN.md.
It is a standard Go module — install the binary or run from source:
# install the CLI
go install github.com/alphaBitCore/loadtest/cmd/loadtest@latest
# or build / run from a checkout
go build -o loadtest ./cmd/loadtest
./loadtest -config profiles/realistic.json -vk <api-key> -out runs/
# or:
go run ./cmd/loadtest -config profiles/realistic.json -vk <api-key> -out runs/Flags:
| flag | meaning |
|---|---|
-config |
profile path (required) — see profiles/ |
-vk |
sets Authorization: Bearer <vk> on every scenario |
-target |
override the target URL (so one profile hits any gateway) |
-model |
override the model string — e.g. a gateway that needs a provider/model form |
-stages |
override stages, e.g. 1:10s,100:60s,1000:120s |
-out |
output directory |
-compare |
regression-compare two summary.json files (no load run) |
-regress-pct |
regression threshold percent for -compare (default 10) |
-header |
extra request header K: V, repeatable — applied to all scenarios after -vk (e.g. Portkey routing headers) |
-version |
print build version and exit |
Profiles ship with a
localhosttarget and aREPLACE_WITH_VKplaceholder — never commit a real domain or key. Point-target/-vkat your own gateway.
The engine is importable — the layers are real packages, not just files:
github.com/alphaBitCore/loadtest/protocol // wire-format adapters (self-registering)
github.com/alphaBitCore/loadtest/config // declarative JSON profile + validation
github.com/alphaBitCore/loadtest/metrics // lock-free HDR aggregator
github.com/alphaBitCore/loadtest/engine // closed/open-loop runners
github.com/alphaBitCore/loadtest/report // text/JSON/CSV/Prom + regression compare
github.com/alphaBitCore/loadtest/cmd/loadtest // the CLI (thin wiring)
profiles/realistic.json is the flagship profile. It blends three conversation
tiers by weight, most of them multi-turn (a multi-turn VU feeds the assistant
reply back into the next turn, so context — and cost, and latency — grows like a
real session):
| tier | weight | turns | stream | shape |
|---|---|---|---|---|
| simple | 55% | 1 | no | short single-shot lookups (the bulk of real traffic) |
| medium | 30% | 4 | yes | a real multi-turn technical chat that builds context |
| complex | 15% | 3 | yes | large system prompt + multi-turn architecture/reasoning dialogue, long output |
Why tiers matter: a single prompt size hides real behaviour — small bodies make
extraction/scanning/cost look free, while large bodies and long outputs exercise
the long-context path, token throughput, and per-byte costs. think_time adds a
pause between turns to model a real user.
content.mode per scenario:
pool— random prompt from a list (varied single-shots).scripted— a fixed, coherent dialogue (turn i usesscript[i]).sized— generate ~approx_input_tokensof input (controls input size).
When comparing gateways (e.g. gateway A vs B vs C), run them in the same wall-clock window, not one after another.
The upstream provider is the dominant latency term and it drifts minute to minute. If you benchmark gateway A, then B, then C sequentially, each samples a different upstream window — so a slow upstream draw during A's turn looks like "A is slow" when it isn't. That sampling unfairness can manufacture a 2–3× "difference" that is pure provider noise.
Run all gateways co-located, simultaneously: they then sample identical upstream conditions concurrently, the shared upstream latency is common-mode, and it cancels out when you take the per-gateway delta. What's left — the TTFT delta between gateways — is ≈ the gateway's own overhead.
multi-service.sh does exactly this: it launches the same profile against every
gateway in the same window each round and prints a per-round delta table.
# edit the SERVICES array in the script, then:
GATEWAY_A_KEY=... GATEWAY_B_KEY=... ./multi-service.sh profiles/realistic.json 50:60s 3gateway ttft_p50 ttft_p95 lat_p95 rps err
------- -------- -------- ------- ---- -----
gateway-a 1107 1702 2084 38.0 0.00%
gateway-b 1204 1919 2231 38.0 0.00%
ttft_p50 delta vs fastest (= gateway overhead; upstream is shared this window):
gateway-a + 0 ms
gateway-b + 97 ms
Gateways under test are defined in the SERVICES array (name|target|model|bearer);
keys come from env vars, never hardcoded. Even simultaneous numbers are only fair
when the execution substrate matches — e.g. comparing a native process against a
gateway running inside a Docker VM is not apples-to-apples.
A gateway-overhead benchmark must isolate the raw forwarding path. Two features distort it, so turn them off on every gateway under test:
- Response cache. A cache hit returns without calling the provider, which
collapses latency and inflates throughput; even on a miss the cache lookup
adds work.
cache_mode: bust(default) prefixes every conversation with a unique UUID so the provider/gateway cache always misses — but for a clean baseline also disable the response cache so the lookup cost and any accidental hits are out of the picture. - Compliance / PII hooks. Request and response hooks add per-request work (content extraction, scanning, rewrite) whose cost scales with body size. Enable them only when you are specifically measuring hook cost.
Measure the raw path first; then measure each feature as a delta on top of that baseline.
See profiles/realistic.json (the flagship mixed workload),
profiles/ai-gateway.json (a minimal example), and profiles/cross-ingress.json
(two ingress protocols of one gateway at once). Schema is documented in
DESIGN.md §Config. With no scenarios block, the top-level defaults form a
single implicit scenario.
For gateway-overhead comparisons against a mock upstream, prompt SIZE decides what you measure, so the perf profiles come in three sized tiers. Run all three and report each separately — a single size hides half the story (a large body makes every gateway JSON-parse-bound and masks routing-core differences; a tiny body exposes them). Sizes follow the industry-standard shapes so numbers are comparable to published benchmarks (NVIDIA/vLLM 128:128, LLMPerf/Anyscale mean-550 chat).
| Profile (non-SSE / SSE) | input/output tokens | Standard | Measures |
|---|---|---|---|
perf-128-nonstream.json / perf-128-stream.json |
128 / 128 | NVIDIA, vLLM | Routing-core overhead — fixed per-request cost (auth/quota/routing/audit); finds the real RPS ceiling (tens of thousands) |
perf-550-nonstream.json / perf-550-stream.json |
550 / 150 | LLMPerf, Anyscale | Realistic chat — the recommended headline number, balanced |
perf-hotpath-nonstream.json / perf-hotpath-stream.json |
~12.5k / 64 | (long-context/RAG) | Large-body path — JSON-parse/forward bound; body handling dominates |
Each tier ships in both forms: *-nonstream (non-SSE — reports throughput/RPS) and
*-stream (SSE — reports TTFT, the first-byte overhead SSE clients feel). Run both.
All three are mode: sized on a mock upstream with cache_mode: bust. At the 128
tier RPS may still be climbing at concurrency 800 — switch to an open-loop sweep
(--stages '@20000:60s,@40000:60s,...') to find the knee. Always confirm the
gateway box CPU is the bottleneck (saturated) — if RPS plateaus while gateway CPU
is idle, you are measuring the mock / load generator / network, not the gateway.
results-<ts>.jsonl— one line per turn, written as it completes (crash-safe):conv_uuid, scenario, turn, stream, latency_ms, ttft_ms, dns/conn/tls/transfer, status, prompt_tokens, completion_tokens, warmup, err.report-<ts>.txt— per-stage × per-scenario table (RPS, ok%, latency p50…p99.9, TTFT p50/p95, ITL, output tokens/s), aggregate, generator-health section, and threshold pass/fail.summary-<ts>.json— machine-readable (consumed bymulti-service.shand-compare).report-<ts>.csv— one row per stage for spreadsheets/notebooks.metrics-<ts>.prom— Prometheus text exposition (textfile collector / Pushgateway).
- Per-stage percentiles are the meaningful view for a step test (each stage is one concurrency level). A flat p50 across rising concurrency means the gateway adds no concurrency penalty.
- TTFT is the chat-UX metric (time to first token); output tokens/s is the throughput metric; ITL catches mid-stream stalls TTFT can't see.
- Generator health tells you whether the load tester itself was the
bottleneck: it lists the host FD limit, any JSONL sink overflow, and
generator-side errors (
gen_port_exhaustion,gen_fd_exhaustion). If those appear, the numbers reflect the harness, not the server — scale out / raise limits and re-run.
- The tool raises its own
RLIMIT_NOFILE(soft→hard) at startup. If it warns that the limit is still too low, raiseulimit -non the generator host. - N concurrent connections need ~N+ FDs and ephemeral ports. Watch the
generator-health section for
gen_port_exhaustion. - The in-memory metrics aggregator is lock-free (atomic HDR histograms,
atomic counters, a fixed error-class array, and a
sync.Mapof status-code counters) so request completions never contend on a mutex at high concurrency. - For a single very large run, one host may not be enough — run multiple
generator hosts against the same target and merge the
summary-*.jsonfiles.
loadtest -compare old.json,new.json [-regress-pct 10] diffs two runs by stage,
flags RPS drops / tail blow-ups, and exits non-zero on a regression — drop it into
CI to fail a build that slows the gateway down. No load is generated.
Each turn carries its conversation UUID (in the prompt and the x-request-id
header). If your gateway logs each request to a database, join.sh is a template
that correlates the client results.jsonl against the server-side rows by that
UUID — putting client-observed latency next to server-measured latency, upstream
time, tokens, and cache status. That is the way to tell whether a client-observed
delta is real gateway overhead or something outside the handler (accept/TLS/
scheduling, upstream variance). Entirely optional — the load tester itself has no
DB dependency; edit join.sh to match your schema.