Skip to content

tscore: Highway-accelerated ASCII to_lower and base64 (runtime SIMD dispatch)#13320

Open
phongn wants to merge 12 commits into
apache:masterfrom
phongn:highway-tolower-base64
Open

tscore: Highway-accelerated ASCII to_lower and base64 (runtime SIMD dispatch)#13320
phongn wants to merge 12 commits into
apache:masterfrom
phongn:highway-tolower-base64

Conversation

@phongn

@phongn phongn commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds SIMD-accelerated implementations of ASCII lowercasing (ts::ascii::tolower_copy / tolower_inplace) and base64 encode/decode (ats_base64_encode / ats_base64_decode), built on Google Highway and selected at runtime by CPU capability. Both are gated behind a new build option that defaults to OFF — without it the scalar paths are used and there is no behavior change to existing builds.

This combines the previously separate to_lower and base64 SIMD efforts into one series and folds in the review fixes applied on our internal branch.

ASCII to_lower

  • New ts::ascii::tolower_copy(dst, src, n) / tolower_inplace(buf, n) in include/tscore/ink_ascii_tolower.h. Folds AZaz; all other bytes (including 0x80–0xFF) pass through unchanged; no UTF-8 folding; in-place (dst == src) supported.
  • Highway runtime-dispatched kernel in ink_ascii_tolower_dispatch.cc (one source compiled for SSE4/AVX2/AVX-512/NEON via foreach_target; the best target for the live CPU is chosen once and cached). When the option is off, a portable scalar loop is used.
  • Migrated the hand-rolled tolower loops to the new API at the relevant call sites — URL cache-key fast path (URL.cc), HPACK.cc, QPACK.cc, UrlRewrite.cc — with behavioral tests added alongside each (test_URL, test_RemapRules, test_HpackIndexingTable).

base64

  • Highway runtime-dispatched SIMD encode/decode (ink_base64_dispatch.{cc,h}), using the vectorized base64 algorithms from simdutf re-expressed in Highway (Muła/Lemire; aqrit's combined standard/URL-safe classifier).
  • Scalar primitives extracted to ink_base64_scalar.h, shared by the scalar path and the SIMD path's tail so the two cannot drift. Decode fuses validation into the SIMD loop and hands the remainder (including truncation at the first non-alphabet byte) to the scalar tail, so SIMD output is byte-for-byte identical to scalar — including in-place decode and mixed standard/URL-safe alphabets.
  • Fixes a latent out-of-bounds read in scalar ats_base64_decode: when the decodable prefix length was not a multiple of four, the old loop ran one iteration past the prefix (over-reading the input, and reading inBuffer[-2]). Decode now processes only whole 4-character groups plus an explicit 2/3-character tail. The decoded length and bytes are unchanged for every well-defined input.

Build / wiring

  • ENABLE_HIGHWAY_DISPATCH (default OFF) gates the SIMD paths via TS_HAS_HIGHWAY_DISPATCH; EXTERNAL_HWY selects an external Highway over the vendored copy.
  • New branch-highway CMake preset builds with the option on, turning the unit tests into real SIMD-vs-scalar parity checks.
  • NOTICE updated to attribute simdutf and Google Highway.

Performance

Measured on an Intel Xeon Gold 6338 (Ice Lake-SP, AVX-512), Release build (-O3), Highway dispatching to its AVX-512 target. Baselines are the scalar paths these replace. The public APIs keep the scalar path below the SIMD thresholds (encode 24 B, decode 32 chars) to avoid dispatch overhead on tiny inputs, which is why the smallest sizes show little gain.

ASCII tolower — ns per call, vs the byte-at-a-time ink_tolower loop:

bytes scalar (ns) Highway (ns) speedup
8 5.9 7.9 0.7×
16 12.6 5.0 2.5×
32 21.8 4.5 4.9×
64 41.2 5.6 7.3×
256 175 12.0 14.6×
1024 676 32.5 20.8×

base64 decode — GB/s on input chars:

chars scalar Highway speedup
64 1.1 5.2 4.9×
128 1.1 6.8 6.4×
512 1.1 6.9 6.4×
64 KB 1.2 8.0 6.9×

base64 encode — GB/s on input bytes:

bytes scalar Highway speedup
96 1.2 3.6 3.1×
200 1.4 5.7 4.2×
512 1.4 6.9 5.1×
64 KB 1.3 7.5 6.0×

Testing

  • Unit tests for both features (test_ink_ascii_tolower.cc, test_ink_base64.cc) compare the public path against an independent scalar reference across sizes, alphabets, truncation, in-place, and buffer-bound cases; with ENABLE_HIGHWAY_DISPATCH=ON they become SIMD-vs-scalar parity tests.
  • tests/fuzzing/fuzz_base64.cc: libFuzzer target that decodes untrusted input and cross-checks both paths under sanitizers.
  • tools/benchmark/benchmark_ascii_tolower.cc reproduces the tolower numbers above.
  • Builds and unit tests pass with the option both ON and OFF.

Notes

  • Depends on the vendored Google Highway copy (Add in a vendor copy of Google Highway #13228).
  • CI currently exercises only the scalar paths; add a job that configures the branch-highway preset to get parity coverage of the SIMD kernels.

🤖 Generated with Claude Code

phongn and others added 12 commits June 24, 2026 16:48
The bulk ASCII tolower loop used to canonicalize the scheme and host
portions of a URL before hashing into the cache key runs at ~1.5 GB/s
scalar (one byte and one ParseRules table lookup per iteration). The
work is trivially data-parallel and there is no per-byte branching, so
a SIMD kernel that lowercases a whole register at once gives a
straightforward speedup once the input is long enough to amortize the
vector setup.

Add a header-only helper ts::memcpy_tolower under
include/tscore/ink_memcpy_tolower.h with a compile-time-selected
cascade of SIMD bodies: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2
on x86_64, plus 16-byte NEON on ARMv8. Wider bodies fall through to
narrower drain loops, so the worst-case scalar tail is always <16
bytes. Selection is purely compile-time; runtime ifunc dispatch is
left for a follow-up.

The AVX-512BW body uses _mm512_mask_add_epi8 to fuse the conditional
"+0x20 where upper" into a single op, and a masked load/store handles
1..63 leftover bytes in a single SIMD pass (inspired by Tony Finch's
copytolower64.c, https://dotat.at/cgi/git/vectolower.git/). The whole
AVX-512BW block is gated at n >= 64 because the masked load/store has
~7 ns of fixed setup that loses to the narrower paths for short
inputs; below 64 bytes we fall through to the AVX2 + SSE2 cascade.

Semantics match the existing ParseRules::ink_tolower table exactly:
bytes in 'A'..'Z' map to 'a'..'z', all others (including 0x80..0xFF)
pass through unchanged.

Replace the static inline memcpy_tolower in src/proxy/hdrs/URL.cc with
this helper. Baseline x86_64 builds use the 16-byte SSE2 path; builds
that opt into a wider -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW)
get the wider bodies automatically. Sub-16-byte inputs (e.g. short
HTTP schemes like "http") use the scalar tail and see no perf change.

Measured throughput on a 2.0 GHz Ice Lake Xeon Gold 6338, mean ns:

  size   scalar   SSE2     AVX2     AVX-512BW
  ----   ------   ----     ----     ---------
  16 B   10.4     2.15     1.75     1.98
  32 B   15.4     2.90     2.24     2.31
  64 B   28.0     4.43     2.85     2.61
  256 B  113      13.87    7.57     6.20
  1024 B 425      50.47    24.23    17.49

Speedup vs scalar at 1024 B: SSE2 8.4x, AVX2 17.5x, AVX-512BW 24.3x.

A new microbenchmark under tools/benchmark covers correctness across
sizes 0..257 (bracketing each SIMD body size) plus an exhaustive byte-
value sweep that guards against any future widening of the case-fold
range, alongside scalar-vs-SIMD throughput numbers and a config-print
case that emits the selected ISA path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Move ts::memcpy_tolower correctness coverage out of the
  ENABLE_BENCHMARKS-gated benchmark and into a new
  src/tscore/unit_tests/test_ink_memcpy_tolower.cc so ctest exercises
  the scalar and SIMD paths in every build. Covers boundary sizes
  bracketing each SIMD body width, the exhaustive 0..255 byte-value
  sweep, and the in-place (dst == src) form (Copilot).

- Fix the implementation-note comment on ts::memcpy_tolower to
  describe the actual AVX-512BW control flow (gated main loop +
  masked-tail load/store + early return), and document that in-place
  (dst == src) is supported on every path (Copilot).

- Add a Catch::Benchmark::keep_memory barrier in
  benchmark_memcpy_tolower so the compiler can no longer DCE the
  inlined stores past the first observed byte (Copilot).

- Migrate the in-place tolower loop in
  src/proxy/http3/QPACK.cc::_encode_header to ts::memcpy_tolower,
  demonstrating the in-place contract (bryancall).

- Add Tony Finch's copytolower64.c attribution to NOTICE
  (masaori335).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
memcpy_tolower carried two warts: the "memcpy" prefix implied
non-overlapping by convention with libc memcpy (we explicitly support
the in-place case), and the unqualified name didn't surface the
ASCII-only semantics. Rename the helper to ts::ascii::tolower_copy
and add a thin ts::ascii::tolower_inplace(buf, n) wrapper so call
sites that operate on a single buffer read naturally instead of
passing the same pointer twice.

Rename the header to include/tscore/ink_ascii_tolower.h, the unit
test to src/tscore/unit_tests/test_ink_ascii_tolower.cc, and the
benchmark to tools/benchmark/benchmark_ascii_tolower.cc to match.
Update the two existing call sites (URL.cc fast-path scheme/host and
QPACK::_encode_header in-place name lowercasing) accordingly. No
behavior change: the helper bodies are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrate two more byte-at-a-time ASCII tolower loops to
ts::ascii::tolower_copy. Both call sites use a separate destination
buffer, so the copy form is the right fit:

- hpack_encode_header_block(): lower-cases each MIMEField name before
  encoding to match the HTTP/2 lowercase-header-name requirement.

- UrlRewrite::_mappingLookup(): lower-cases the incoming request host
  into a stack buffer before the table lookup, so the lookup is
  case-insensitive against the lower-cased keys built at config-load
  time. The previous code used libc tolower(int) on signed char values,
  which is technically UB for bytes >= 0x80; the new call avoids that.

The existing unit tests in test_URL, test_HpackIndexingTable, and
test_RemapRules executed the tolower paths but only with inputs that
were already lower-case, so they would have missed a "skip the
lowercasing" regression. Add focused behavioral coverage:

- test_URL.cc: four extra get_hash_test_cases that hash a request with
  uppercase/mixed-case scheme or host and require an equal hash to the
  lower-case form. Includes a 49-byte uppercase host that crosses both
  the 16- and 32-byte SIMD bodies.

- test_RemapRules.cc: a new SCENARIO that builds a UrlRewrite from a
  map for a lower-case host and requires that uppercase, mixed-case,
  and long-uppercase request hosts all match.

- test_HpackIndexingTable.cc: a new TEST_CASE that encodes a long
  mixed-case field name with hpack_encode_header_block and requires
  the encoded byte stream to be identical to encoding the same field
  with an already-lower-case name.

QPACK already exercises the in-place path through its Encoding test
and the helper's own ts::ascii::tolower_inplace unit test covers
in-place semantics exhaustively; an additional focused QPACK test
would need the external .qif fixture infrastructure, which is out of
scope here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an optional runtime-dispatched SIMD path for ats_base64_encode /
ats_base64_decode, gated by the ENABLE_HIGHWAY_DISPATCH CMake option
(off by default; the scalar path is unchanged when off). The scalar
primitives move to ink_base64_scalar.h so the SIMD kernel's scalar
tail and the scalar path share one definition and cannot drift.

The kernels in ink_base64_dispatch.cc use Highway's portable SIMD ops
(foreach_target + HWY_DYNAMIC_DISPATCH), so one source compiles for
SSE4/AVX2/AVX-512 and the best target supported by the live CPU is
chosen at runtime. The algorithms and lookup tables derive from the
simdutf library (Mula/Lemire vectorized base64; aqrit's combined
standard/URL-safe classifier), re-expressed in Highway; see NOTICE.

Decode fuses validation into the SIMD loop (consuming only fully-valid
16-byte blocks and finishing the remainder, including any non-alphabet
truncation, on the scalar tail), so output is byte-for-byte identical
to the scalar decoder, including in-place use and mixed standard/URL
alphabets. encode reuses the scalar encoder for the padded tail.

Tests: unit_tests/test_ink_base64.cc cross-checks the public path
against the scalar reference across sizes that straddle the SIMD
thresholds, both alphabets, in-place decode, truncation at every
position, and undersized output buffers; with the option on these
become SIMD-vs-scalar parity checks. tests/fuzzing/fuzz_base64.cc adds
a libFuzzer target that decodes untrusted input and cross-checks both
paths under sanitizers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant