Skip to content

AlphaBitCore/nexus-loadtest

loadtest — LLM gateway stress tester

CI Go Reference License: Apache-2.0

A purpose-built, scenario-driven load generator for LLM gateways — any OpenAI- or Anthropic-compatible endpoint. Decoupled from the system under test (zero DB dependency). Designed to scale to tens of thousands of concurrent virtual users / requests from one host.

The goal is to simulate real traffic — a weighted blend of short lookups, multi-turn chats, and heavy long-context conversations — not to fire N identical requests. Realistic shape is what makes the numbers mean something. It models the things a generic HTTP tool ignores: multi-turn conversations (growing context), streaming with time-to-first-token (TTFT) and inter-token latency (ITL), token throughput, and response/prompt caching that silently distorts results.

Full design rationale: DESIGN.md.

Install & run

It is a standard Go module — install the binary or run from source:

# install the CLI
go install github.com/alphaBitCore/loadtest/cmd/loadtest@latest

# or build / run from a checkout
go build -o loadtest ./cmd/loadtest
./loadtest -config profiles/realistic.json -vk <api-key> -out runs/
# or:
go run ./cmd/loadtest -config profiles/realistic.json -vk <api-key> -out runs/

Flags:

flag meaning
-config profile path (required) — see profiles/
-vk sets Authorization: Bearer <vk> on every scenario
-target override the target URL (so one profile hits any gateway)
-model override the model string — e.g. a gateway that needs a provider/model form
-stages override stages, e.g. 1:10s,100:60s,1000:120s
-out output directory
-compare regression-compare two summary.json files (no load run)
-regress-pct regression threshold percent for -compare (default 10)
-header extra request header K: V, repeatable — applied to all scenarios after -vk (e.g. Portkey routing headers)
-version print build version and exit

Profiles ship with a localhost target and a REPLACE_WITH_VK placeholder — never commit a real domain or key. Point -target/-vk at your own gateway.

Using it as a library

The engine is importable — the layers are real packages, not just files:

github.com/alphaBitCore/loadtest/protocol  // wire-format adapters (self-registering)
github.com/alphaBitCore/loadtest/config    // declarative JSON profile + validation
github.com/alphaBitCore/loadtest/metrics   // lock-free HDR aggregator
github.com/alphaBitCore/loadtest/engine    // closed/open-loop runners
github.com/alphaBitCore/loadtest/report    // text/JSON/CSV/Prom + regression compare
github.com/alphaBitCore/loadtest/cmd/loadtest // the CLI (thin wiring)

Realistic workloads (simple / medium / complex, multi-turn)

profiles/realistic.json is the flagship profile. It blends three conversation tiers by weight, most of them multi-turn (a multi-turn VU feeds the assistant reply back into the next turn, so context — and cost, and latency — grows like a real session):

tier weight turns stream shape
simple 55% 1 no short single-shot lookups (the bulk of real traffic)
medium 30% 4 yes a real multi-turn technical chat that builds context
complex 15% 3 yes large system prompt + multi-turn architecture/reasoning dialogue, long output

Why tiers matter: a single prompt size hides real behaviour — small bodies make extraction/scanning/cost look free, while large bodies and long outputs exercise the long-context path, token throughput, and per-byte costs. think_time adds a pause between turns to model a real user.

content.mode per scenario:

  • pool — random prompt from a list (varied single-shots).
  • scripted — a fixed, coherent dialogue (turn i uses script[i]).
  • sized — generate ~approx_input_tokens of input (controls input size).

Why test multiple services simultaneously

When comparing gateways (e.g. gateway A vs B vs C), run them in the same wall-clock window, not one after another.

The upstream provider is the dominant latency term and it drifts minute to minute. If you benchmark gateway A, then B, then C sequentially, each samples a different upstream window — so a slow upstream draw during A's turn looks like "A is slow" when it isn't. That sampling unfairness can manufacture a 2–3× "difference" that is pure provider noise.

Run all gateways co-located, simultaneously: they then sample identical upstream conditions concurrently, the shared upstream latency is common-mode, and it cancels out when you take the per-gateway delta. What's left — the TTFT delta between gateways — is ≈ the gateway's own overhead.

multi-service.sh does exactly this: it launches the same profile against every gateway in the same window each round and prints a per-round delta table.

# edit the SERVICES array in the script, then:
GATEWAY_A_KEY=... GATEWAY_B_KEY=... ./multi-service.sh profiles/realistic.json 50:60s 3
gateway     ttft_p50  ttft_p95  lat_p95  rps   err
-------     --------  --------  -------  ----  -----
gateway-a   1107      1702      2084     38.0  0.00%
gateway-b   1204      1919      2231     38.0  0.00%

ttft_p50 delta vs fastest (= gateway overhead; upstream is shared this window):
  gateway-a    +    0 ms
  gateway-b    +   97 ms

Gateways under test are defined in the SERVICES array (name|target|model|bearer); keys come from env vars, never hardcoded. Even simultaneous numbers are only fair when the execution substrate matches — e.g. comparing a native process against a gateway running inside a Docker VM is not apples-to-apples.

Benchmark hygiene — disable hooks and cache

A gateway-overhead benchmark must isolate the raw forwarding path. Two features distort it, so turn them off on every gateway under test:

  • Response cache. A cache hit returns without calling the provider, which collapses latency and inflates throughput; even on a miss the cache lookup adds work. cache_mode: bust (default) prefixes every conversation with a unique UUID so the provider/gateway cache always misses — but for a clean baseline also disable the response cache so the lookup cost and any accidental hits are out of the picture.
  • Compliance / PII hooks. Request and response hooks add per-request work (content extraction, scanning, rewrite) whose cost scales with body size. Enable them only when you are specifically measuring hook cost.

Measure the raw path first; then measure each feature as a delta on top of that baseline.

Config

See profiles/realistic.json (the flagship mixed workload), profiles/ai-gateway.json (a minimal example), and profiles/cross-ingress.json (two ingress protocols of one gateway at once). Schema is documented in DESIGN.md §Config. With no scenarios block, the top-level defaults form a single implicit scenario.

Gateway perf tiers (head-to-head overhead) — pick by prompt size

For gateway-overhead comparisons against a mock upstream, prompt SIZE decides what you measure, so the perf profiles come in three sized tiers. Run all three and report each separately — a single size hides half the story (a large body makes every gateway JSON-parse-bound and masks routing-core differences; a tiny body exposes them). Sizes follow the industry-standard shapes so numbers are comparable to published benchmarks (NVIDIA/vLLM 128:128, LLMPerf/Anyscale mean-550 chat).

Profile (non-SSE / SSE) input/output tokens Standard Measures
perf-128-nonstream.json / perf-128-stream.json 128 / 128 NVIDIA, vLLM Routing-core overhead — fixed per-request cost (auth/quota/routing/audit); finds the real RPS ceiling (tens of thousands)
perf-550-nonstream.json / perf-550-stream.json 550 / 150 LLMPerf, Anyscale Realistic chat — the recommended headline number, balanced
perf-hotpath-nonstream.json / perf-hotpath-stream.json ~12.5k / 64 (long-context/RAG) Large-body path — JSON-parse/forward bound; body handling dominates

Each tier ships in both forms: *-nonstream (non-SSE — reports throughput/RPS) and *-stream (SSE — reports TTFT, the first-byte overhead SSE clients feel). Run both.

All three are mode: sized on a mock upstream with cache_mode: bust. At the 128 tier RPS may still be climbing at concurrency 800 — switch to an open-loop sweep (--stages '@20000:60s,@40000:60s,...') to find the knee. Always confirm the gateway box CPU is the bottleneck (saturated) — if RPS plateaus while gateway CPU is idle, you are measuring the mock / load generator / network, not the gateway.

Outputs (in -out)

  • results-<ts>.jsonl — one line per turn, written as it completes (crash-safe): conv_uuid, scenario, turn, stream, latency_ms, ttft_ms, dns/conn/tls/transfer, status, prompt_tokens, completion_tokens, warmup, err.
  • report-<ts>.txt — per-stage × per-scenario table (RPS, ok%, latency p50…p99.9, TTFT p50/p95, ITL, output tokens/s), aggregate, generator-health section, and threshold pass/fail.
  • summary-<ts>.json — machine-readable (consumed by multi-service.sh and -compare).
  • report-<ts>.csv — one row per stage for spreadsheets/notebooks.
  • metrics-<ts>.prom — Prometheus text exposition (textfile collector / Pushgateway).

Reading the report

  • Per-stage percentiles are the meaningful view for a step test (each stage is one concurrency level). A flat p50 across rising concurrency means the gateway adds no concurrency penalty.
  • TTFT is the chat-UX metric (time to first token); output tokens/s is the throughput metric; ITL catches mid-stream stalls TTFT can't see.
  • Generator health tells you whether the load tester itself was the bottleneck: it lists the host FD limit, any JSONL sink overflow, and generator-side errors (gen_port_exhaustion, gen_fd_exhaustion). If those appear, the numbers reflect the harness, not the server — scale out / raise limits and re-run.

Scaling to tens of thousands of concurrency

  • The tool raises its own RLIMIT_NOFILE (soft→hard) at startup. If it warns that the limit is still too low, raise ulimit -n on the generator host.
  • N concurrent connections need ~N+ FDs and ephemeral ports. Watch the generator-health section for gen_port_exhaustion.
  • The in-memory metrics aggregator is lock-free (atomic HDR histograms, atomic counters, a fixed error-class array, and a sync.Map of status-code counters) so request completions never contend on a mutex at high concurrency.
  • For a single very large run, one host may not be enough — run multiple generator hosts against the same target and merge the summary-*.json files.

Regression gate (CI)

loadtest -compare old.json,new.json [-regress-pct 10] diffs two runs by stage, flags RPS drops / tail blow-ups, and exits non-zero on a regression — drop it into CI to fail a build that slows the gateway down. No load is generated.

Server-side correlation (optional)

Each turn carries its conversation UUID (in the prompt and the x-request-id header). If your gateway logs each request to a database, join.sh is a template that correlates the client results.jsonl against the server-side rows by that UUID — putting client-observed latency next to server-measured latency, upstream time, tokens, and cache status. That is the way to tell whether a client-observed delta is real gateway overhead or something outside the handler (accept/TLS/ scheduling, upstream variance). Entirely optional — the load tester itself has no DB dependency; edit join.sh to match your schema.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors