loadtest — LLM gateway stress tester

A purpose-built, scenario-driven load generator for LLM gateways — any OpenAI- or Anthropic-compatible endpoint. Decoupled from the system under test (zero DB dependency). Designed to scale to tens of thousands of concurrent virtual users / requests from one host.

The goal is to simulate real traffic — a weighted blend of short lookups, multi-turn chats, and heavy long-context conversations — not to fire N identical requests. Realistic shape is what makes the numbers mean something. It models the things a generic HTTP tool ignores: multi-turn conversations (growing context), streaming with time-to-first-token (TTFT) and inter-token latency (ITL), token throughput, and response/prompt caching that silently distorts results.

Full design rationale: DESIGN.md.

Install & run

It is a standard Go module — install the binary or run from source:

# install the CLI
go install github.com/alphaBitCore/loadtest/cmd/loadtest@latest

# or build / run from a checkout
go build -o loadtest ./cmd/loadtest
./loadtest -config profiles/realistic.json -vk <api-key> -out runs/
# or:
go run ./cmd/loadtest -config profiles/realistic.json -vk <api-key> -out runs/

Flags:

flag	meaning
`-config`	profile path (required) — see `profiles/`
`-vk`	sets `Authorization: Bearer <vk>` on every scenario
`-target`	override the target URL (so one profile hits any gateway)
`-model`	override the model string — e.g. a gateway that needs a `provider/model` form
`-stages`	override stages, e.g. `1:10s,100:60s,1000:120s`
`-out`	output directory
`-compare`	regression-compare two `summary.json` files (no load run)
`-regress-pct`	regression threshold percent for `-compare` (default 10)
`-header`	extra request header `K: V`, repeatable — applied to all scenarios after `-vk` (e.g. Portkey routing headers)
`-version`	print build version and exit

Profiles ship with a localhost target and a REPLACE_WITH_VK placeholder — never commit a real domain or key. Point -target/-vk at your own gateway.

Using it as a library

The engine is importable — the layers are real packages, not just files:

github.com/alphaBitCore/loadtest/protocol  // wire-format adapters (self-registering)
github.com/alphaBitCore/loadtest/config    // declarative JSON profile + validation
github.com/alphaBitCore/loadtest/metrics   // lock-free HDR aggregator
github.com/alphaBitCore/loadtest/engine    // closed/open-loop runners
github.com/alphaBitCore/loadtest/report    // text/JSON/CSV/Prom + regression compare
github.com/alphaBitCore/loadtest/cmd/loadtest // the CLI (thin wiring)

Realistic workloads (simple / medium / complex, multi-turn)

profiles/realistic.json is the flagship profile. It blends three conversation tiers by weight, most of them multi-turn (a multi-turn VU feeds the assistant reply back into the next turn, so context — and cost, and latency — grows like a real session):

tier	weight	turns	stream	shape
simple	55%	1	no	short single-shot lookups (the bulk of real traffic)
medium	30%	4	yes	a real multi-turn technical chat that builds context
complex	15%	3	yes	large system prompt + multi-turn architecture/reasoning dialogue, long output

Why tiers matter: a single prompt size hides real behaviour — small bodies make extraction/scanning/cost look free, while large bodies and long outputs exercise the long-context path, token throughput, and per-byte costs. think_time adds a pause between turns to model a real user.

content.mode per scenario:

pool — random prompt from a list (varied single-shots).
scripted — a fixed, coherent dialogue (turn i uses script[i]).
sized — generate ~approx_input_tokens of input (controls input size).

Why test multiple services simultaneously

When comparing gateways (e.g. gateway A vs B vs C), run them in the same wall-clock window, not one after another.

The upstream provider is the dominant latency term and it drifts minute to minute. If you benchmark gateway A, then B, then C sequentially, each samples a different upstream window — so a slow upstream draw during A's turn looks like "A is slow" when it isn't. That sampling unfairness can manufacture a 2–3× "difference" that is pure provider noise.

Run all gateways co-located, simultaneously: they then sample identical upstream conditions concurrently, the shared upstream latency is common-mode, and it cancels out when you take the per-gateway delta. What's left — the TTFT delta between gateways — is ≈ the gateway's own overhead.

multi-service.sh does exactly this: it launches the same profile against every gateway in the same window each round and prints a per-round delta table.

# edit the SERVICES array in the script, then:
GATEWAY_A_KEY=... GATEWAY_B_KEY=... ./multi-service.sh profiles/realistic.json 50:60s 3

gateway     ttft_p50  ttft_p95  lat_p95  rps   err
-------     --------  --------  -------  ----  -----
gateway-a   1107      1702      2084     38.0  0.00%
gateway-b   1204      1919      2231     38.0  0.00%

ttft_p50 delta vs fastest (= gateway overhead; upstream is shared this window):
  gateway-a    +    0 ms
  gateway-b    +   97 ms

Gateways under test are defined in the SERVICES array (name|target|model|bearer); keys come from env vars, never hardcoded. Even simultaneous numbers are only fair when the execution substrate matches — e.g. comparing a native process against a gateway running inside a Docker VM is not apples-to-apples.

Benchmark hygiene — disable hooks and cache

A gateway-overhead benchmark must isolate the raw forwarding path. Two features distort it, so turn them off on every gateway under test:

Response cache. A cache hit returns without calling the provider, which collapses latency and inflates throughput; even on a miss the cache lookup adds work. cache_mode: bust (default) prefixes every conversation with a unique UUID so the provider/gateway cache always misses — but for a clean baseline also disable the response cache so the lookup cost and any accidental hits are out of the picture.
Compliance / PII hooks. Request and response hooks add per-request work (content extraction, scanning, rewrite) whose cost scales with body size. Enable them only when you are specifically measuring hook cost.

Measure the raw path first; then measure each feature as a delta on top of that baseline.

Config

See profiles/realistic.json (the flagship mixed workload), profiles/ai-gateway.json (a minimal example), and profiles/cross-ingress.json (two ingress protocols of one gateway at once). Schema is documented in DESIGN.md §Config. With no scenarios block, the top-level defaults form a single implicit scenario.

Gateway perf tiers (head-to-head overhead) — pick by prompt size

For gateway-overhead comparisons against a mock upstream, prompt SIZE decides what you measure, so the perf profiles come in three sized tiers. Run all three and report each separately — a single size hides half the story (a large body makes every gateway JSON-parse-bound and masks routing-core differences; a tiny body exposes them). Sizes follow the industry-standard shapes so numbers are comparable to published benchmarks (NVIDIA/vLLM 128:128, LLMPerf/Anyscale mean-550 chat).

Profile (non-SSE / SSE)	input/output tokens	Standard	Measures
`perf-128-nonstream.json` / `perf-128-stream.json`	128 / 128	NVIDIA, vLLM	Routing-core overhead — fixed per-request cost (auth/quota/routing/audit); finds the real RPS ceiling (tens of thousands)
`perf-550-nonstream.json` / `perf-550-stream.json`	550 / 150	LLMPerf, Anyscale	Realistic chat — the recommended headline number, balanced
`perf-hotpath-nonstream.json` / `perf-hotpath-stream.json`	~12.5k / 64	(long-context/RAG)	Large-body path — JSON-parse/forward bound; body handling dominates

Each tier ships in both forms: *-nonstream (non-SSE — reports throughput/RPS) and *-stream (SSE — reports TTFT, the first-byte overhead SSE clients feel). Run both.

All three are mode: sized on a mock upstream with cache_mode: bust. At the 128 tier RPS may still be climbing at concurrency 800 — switch to an open-loop sweep (--stages '@20000:60s,@40000:60s,...') to find the knee. Always confirm the gateway box CPU is the bottleneck (saturated) — if RPS plateaus while gateway CPU is idle, you are measuring the mock / load generator / network, not the gateway.

Outputs (in `-out`)

results-<ts>.jsonl — one line per turn, written as it completes (crash-safe): conv_uuid, scenario, turn, stream, latency_ms, ttft_ms, dns/conn/tls/transfer, status, prompt_tokens, completion_tokens, warmup, err.
report-<ts>.txt — per-stage × per-scenario table (RPS, ok%, latency p50…p99.9, TTFT p50/p95, ITL, output tokens/s), aggregate, generator-health section, and threshold pass/fail.
summary-<ts>.json — machine-readable (consumed by multi-service.sh and -compare).
report-<ts>.csv — one row per stage for spreadsheets/notebooks.
metrics-<ts>.prom — Prometheus text exposition (textfile collector / Pushgateway).

Reading the report

Per-stage percentiles are the meaningful view for a step test (each stage is one concurrency level). A flat p50 across rising concurrency means the gateway adds no concurrency penalty.
TTFT is the chat-UX metric (time to first token); output tokens/s is the throughput metric; ITL catches mid-stream stalls TTFT can't see.
Generator health tells you whether the load tester itself was the bottleneck: it lists the host FD limit, any JSONL sink overflow, and generator-side errors (gen_port_exhaustion, gen_fd_exhaustion). If those appear, the numbers reflect the harness, not the server — scale out / raise limits and re-run.

Scaling to tens of thousands of concurrency

The tool raises its own RLIMIT_NOFILE (soft→hard) at startup. If it warns that the limit is still too low, raise ulimit -n on the generator host.
N concurrent connections need ~N+ FDs and ephemeral ports. Watch the generator-health section for gen_port_exhaustion.
The in-memory metrics aggregator is lock-free (atomic HDR histograms, atomic counters, a fixed error-class array, and a sync.Map of status-code counters) so request completions never contend on a mutex at high concurrency.
For a single very large run, one host may not be enough — run multiple generator hosts against the same target and merge the summary-*.json files.

Regression gate (CI)

loadtest -compare old.json,new.json [-regress-pct 10] diffs two runs by stage, flags RPS drops / tail blow-ups, and exits non-zero on a regression — drop it into CI to fail a build that slows the gateway down. No load is generated.

Server-side correlation (optional)

Each turn carries its conversation UUID (in the prompt and the x-request-id header). If your gateway logs each request to a database, join.sh is a template that correlates the client results.jsonl against the server-side rows by that UUID — putting client-observed latency next to server-measured latency, upstream time, tokens, and cache status. That is the way to tell whether a client-observed delta is real gateway overhead or something outside the handler (accept/TLS/ scheduling, upstream variance). Entirely optional — the load tester itself has no DB dependency; edit join.sh to match your schema.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
cmd/loadtest		cmd/loadtest
config		config
engine		engine
metrics		metrics
profiles		profiles
protocol		protocol
report		report
scripts		scripts
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
go.mod		go.mod
go.sum		go.sum
join.sh		join.sh
multi-service.sh		multi-service.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

loadtest — LLM gateway stress tester

Install & run

Using it as a library

Realistic workloads (simple / medium / complex, multi-turn)

Why test multiple services simultaneously

Benchmark hygiene — disable hooks and cache

Config

Gateway perf tiers (head-to-head overhead) — pick by prompt size

Outputs (in `-out`)

Reading the report

Scaling to tens of thousands of concurrency

Regression gate (CI)

Server-side correlation (optional)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

loadtest — LLM gateway stress tester

Install & run

Using it as a library

Realistic workloads (simple / medium / complex, multi-turn)

Why test multiple services simultaneously

Benchmark hygiene — disable hooks and cache

Config

Gateway perf tiers (head-to-head overhead) — pick by prompt size

Outputs (in -out)

Reading the report

Scaling to tens of thousands of concurrency

Regression gate (CI)

Server-side correlation (optional)

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Outputs (in `-out`)

Packages