tscore: Highway-accelerated ASCII to_lower and base64 (runtime SIMD dispatch) by phongn · Pull Request #13320 · apache/trafficserver

phongn · 2026-06-24T17:35:57Z

Summary

Adds SIMD-accelerated implementations of ASCII lowercasing (ts::ascii::tolower_copy / tolower_inplace) and base64 encode/decode (ats_base64_encode / ats_base64_decode), built on Google Highway and selected at runtime by CPU capability. Both are gated behind a new build option that defaults to OFF — without it the scalar paths are used and there is no behavior change to existing builds.

This combines the previously separate to_lower and base64 SIMD efforts into one series and folds in the review fixes applied on our internal branch.

ASCII to_lower

New ts::ascii::tolower_copy(dst, src, n) / tolower_inplace(buf, n) in include/tscore/ink_ascii_tolower.h. Folds A–Z → a–z; all other bytes (including 0x80–0xFF) pass through unchanged; no UTF-8 folding; in-place (dst == src) supported.
Highway runtime-dispatched kernel in ink_ascii_tolower_dispatch.cc (one source compiled for SSE4/AVX2/AVX-512/NEON via foreach_target; the best target for the live CPU is chosen once and cached). When the option is off, a portable scalar loop is used.
Migrated the hand-rolled tolower loops to the new API at the relevant call sites — URL cache-key fast path (URL.cc), HPACK.cc, QPACK.cc, UrlRewrite.cc — with behavioral tests added alongside each (test_URL, test_RemapRules, test_HpackIndexingTable).

base64

Highway runtime-dispatched SIMD encode/decode (ink_base64_dispatch.{cc,h}), using the vectorized base64 algorithms from simdutf re-expressed in Highway (Muła/Lemire; aqrit's combined standard/URL-safe classifier).
Scalar primitives extracted to ink_base64_scalar.h, shared by the scalar path and the SIMD path's tail so the two cannot drift. Decode fuses validation into the SIMD loop and hands the remainder (including truncation at the first non-alphabet byte) to the scalar tail, so SIMD output is byte-for-byte identical to scalar — including in-place decode and mixed standard/URL-safe alphabets.
Fixes a latent out-of-bounds read in scalar ats_base64_decode: when the decodable prefix length was not a multiple of four, the old loop ran one iteration past the prefix (over-reading the input, and reading inBuffer[-2]). Decode now processes only whole 4-character groups plus an explicit 2/3-character tail. The decoded length and bytes are unchanged for every well-defined input.

Build / wiring

ENABLE_HIGHWAY_DISPATCH (default OFF) gates the SIMD paths via TS_HAS_HIGHWAY_DISPATCH; EXTERNAL_HWY selects an external Highway over the vendored copy.
New branch-highway CMake preset builds with the option on, turning the unit tests into real SIMD-vs-scalar parity checks.
NOTICE updated to attribute simdutf and Google Highway.

Performance

Measured on an Intel Xeon Gold 6338 (Ice Lake-SP, AVX-512), Release build (-O3), Highway dispatching to its AVX-512 target. Baselines are the scalar paths these replace. The public APIs keep the scalar path below the SIMD thresholds (encode 24 B, decode 32 chars) to avoid dispatch overhead on tiny inputs, which is why the smallest sizes show little gain.

ASCII tolower — ns per call, vs the byte-at-a-time ink_tolower loop:

bytes	scalar (ns)	Highway (ns)	speedup
8	5.9	7.9	0.7×
16	12.6	5.0	2.5×
32	21.8	4.5	4.9×
64	41.2	5.6	7.3×
256	175	12.0	14.6×
1024	676	32.5	20.8×

base64 decode — GB/s on input chars:

chars	scalar	Highway	speedup
64	1.1	5.2	4.9×
128	1.1	6.8	6.4×
512	1.1	6.9	6.4×
64 KB	1.2	8.0	6.9×

base64 encode — GB/s on input bytes:

bytes	scalar	Highway	speedup
96	1.2	3.6	3.1×
200	1.4	5.7	4.2×
512	1.4	6.9	5.1×
64 KB	1.3	7.5	6.0×

Testing

Unit tests for both features (test_ink_ascii_tolower.cc, test_ink_base64.cc) compare the public path against an independent scalar reference across sizes, alphabets, truncation, in-place, and buffer-bound cases; with ENABLE_HIGHWAY_DISPATCH=ON they become SIMD-vs-scalar parity tests.
tests/fuzzing/fuzz_base64.cc: libFuzzer target that decodes untrusted input and cross-checks both paths under sanitizers.
tools/benchmark/benchmark_ascii_tolower.cc reproduces the tolower numbers above.
Builds and unit tests pass with the option both ON and OFF.

Notes

Depends on the vendored Google Highway copy (Add in a vendor copy of Google Highway #13228).
CI currently exercises only the scalar paths; add a job that configures the branch-highway preset to get parity coverage of the SIMD kernels.

🤖 Generated with Claude Code

The bulk ASCII tolower loop used to canonicalize the scheme and host portions of a URL before hashing into the cache key runs at ~1.5 GB/s scalar (one byte and one ParseRules table lookup per iteration). The work is trivially data-parallel and there is no per-byte branching, so a SIMD kernel that lowercases a whole register at once gives a straightforward speedup once the input is long enough to amortize the vector setup. Add a header-only helper ts::memcpy_tolower under include/tscore/ink_memcpy_tolower.h with a compile-time-selected cascade of SIMD bodies: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Wider bodies fall through to narrower drain loops, so the worst-case scalar tail is always <16 bytes. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. The AVX-512BW body uses _mm512_mask_add_epi8 to fuse the conditional "+0x20 where upper" into a single op, and a masked load/store handles 1..63 leftover bytes in a single SIMD pass (inspired by Tony Finch's copytolower64.c, https://dotat.at/cgi/git/vectolower.git/). The whole AVX-512BW block is gated at n >= 64 because the masked load/store has ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes we fall through to the AVX2 + SSE2 cascade. Semantics match the existing ParseRules::ink_tolower table exactly: bytes in 'A'..'Z' map to 'a'..'z', all others (including 0x80..0xFF) pass through unchanged. Replace the static inline memcpy_tolower in src/proxy/hdrs/URL.cc with this helper. Baseline x86_64 builds use the 16-byte SSE2 path; builds that opt into a wider -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW) get the wider bodies automatically. Sub-16-byte inputs (e.g. short HTTP schemes like "http") use the scalar tail and see no perf change. Measured throughput on a 2.0 GHz Ice Lake Xeon Gold 6338, mean ns: size scalar SSE2 AVX2 AVX-512BW ---- ------ ---- ---- --------- 16 B 10.4 2.15 1.75 1.98 32 B 15.4 2.90 2.24 2.31 64 B 28.0 4.43 2.85 2.61 256 B 113 13.87 7.57 6.20 1024 B 425 50.47 24.23 17.49 Speedup vs scalar at 1024 B: SSE2 8.4x, AVX2 17.5x, AVX-512BW 24.3x. A new microbenchmark under tools/benchmark covers correctness across sizes 0..257 (bracketing each SIMD body size) plus an exhaustive byte- value sweep that guards against any future widening of the case-fold range, alongside scalar-vs-SIMD throughput numbers and a config-print case that emits the selected ISA path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Move ts::memcpy_tolower correctness coverage out of the ENABLE_BENCHMARKS-gated benchmark and into a new src/tscore/unit_tests/test_ink_memcpy_tolower.cc so ctest exercises the scalar and SIMD paths in every build. Covers boundary sizes bracketing each SIMD body width, the exhaustive 0..255 byte-value sweep, and the in-place (dst == src) form (Copilot). - Fix the implementation-note comment on ts::memcpy_tolower to describe the actual AVX-512BW control flow (gated main loop + masked-tail load/store + early return), and document that in-place (dst == src) is supported on every path (Copilot). - Add a Catch::Benchmark::keep_memory barrier in benchmark_memcpy_tolower so the compiler can no longer DCE the inlined stores past the first observed byte (Copilot). - Migrate the in-place tolower loop in src/proxy/http3/QPACK.cc::_encode_header to ts::memcpy_tolower, demonstrating the in-place contract (bryancall). - Add Tony Finch's copytolower64.c attribution to NOTICE (masaori335). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

memcpy_tolower carried two warts: the "memcpy" prefix implied non-overlapping by convention with libc memcpy (we explicitly support the in-place case), and the unqualified name didn't surface the ASCII-only semantics. Rename the helper to ts::ascii::tolower_copy and add a thin ts::ascii::tolower_inplace(buf, n) wrapper so call sites that operate on a single buffer read naturally instead of passing the same pointer twice. Rename the header to include/tscore/ink_ascii_tolower.h, the unit test to src/tscore/unit_tests/test_ink_ascii_tolower.cc, and the benchmark to tools/benchmark/benchmark_ascii_tolower.cc to match. Update the two existing call sites (URL.cc fast-path scheme/host and QPACK::_encode_header in-place name lowercasing) accordingly. No behavior change: the helper bodies are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Migrate two more byte-at-a-time ASCII tolower loops to ts::ascii::tolower_copy. Both call sites use a separate destination buffer, so the copy form is the right fit: - hpack_encode_header_block(): lower-cases each MIMEField name before encoding to match the HTTP/2 lowercase-header-name requirement. - UrlRewrite::_mappingLookup(): lower-cases the incoming request host into a stack buffer before the table lookup, so the lookup is case-insensitive against the lower-cased keys built at config-load time. The previous code used libc tolower(int) on signed char values, which is technically UB for bytes >= 0x80; the new call avoids that. The existing unit tests in test_URL, test_HpackIndexingTable, and test_RemapRules executed the tolower paths but only with inputs that were already lower-case, so they would have missed a "skip the lowercasing" regression. Add focused behavioral coverage: - test_URL.cc: four extra get_hash_test_cases that hash a request with uppercase/mixed-case scheme or host and require an equal hash to the lower-case form. Includes a 49-byte uppercase host that crosses both the 16- and 32-byte SIMD bodies. - test_RemapRules.cc: a new SCENARIO that builds a UrlRewrite from a map for a lower-case host and requires that uppercase, mixed-case, and long-uppercase request hosts all match. - test_HpackIndexingTable.cc: a new TEST_CASE that encodes a long mixed-case field name with hpack_encode_header_block and requires the encoded byte stream to be identical to encoding the same field with an already-lower-case name. QPACK already exercises the in-place path through its Encoding test and the helper's own ts::ascii::tolower_inplace unit test covers in-place semantics exhaustively; an additional focused QPACK test would need the external .qif fixture infrastructure, which is out of scope here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add an optional runtime-dispatched SIMD path for ats_base64_encode / ats_base64_decode, gated by the ENABLE_HIGHWAY_DISPATCH CMake option (off by default; the scalar path is unchanged when off). The scalar primitives move to ink_base64_scalar.h so the SIMD kernel's scalar tail and the scalar path share one definition and cannot drift. The kernels in ink_base64_dispatch.cc use Highway's portable SIMD ops (foreach_target + HWY_DYNAMIC_DISPATCH), so one source compiles for SSE4/AVX2/AVX-512 and the best target supported by the live CPU is chosen at runtime. The algorithms and lookup tables derive from the simdutf library (Mula/Lemire vectorized base64; aqrit's combined standard/URL-safe classifier), re-expressed in Highway; see NOTICE. Decode fuses validation into the SIMD loop (consuming only fully-valid 16-byte blocks and finishing the remainder, including any non-alphabet truncation, on the scalar tail), so output is byte-for-byte identical to the scalar decoder, including in-place use and mixed standard/URL alphabets. encode reuses the scalar encoder for the padded tail. Tests: unit_tests/test_ink_base64.cc cross-checks the public path against the scalar reference across sizes that straddle the SIMD thresholds, both alphabets, in-place decode, truncation at every position, and undersized output buffers; with the option on these become SIMD-vs-scalar parity checks. tests/fuzzing/fuzz_base64.cc adds a libFuzzer target that decodes untrusted input and cross-checks both paths under sanitizers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

phongn and others added 12 commits June 24, 2026 16:48

Implement ascii::to_lower using Google Highway's SIMD library

2df3844

Integrate internal or external highway dispatch

8f158c5

Fix merge error

779ce45

Use old API for backport

59b3602

fix another merge error

96cf854

Additional Claude Fable remediations

9e52929

fix wrong namespace

f430a25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tscore: Highway-accelerated ASCII to_lower and base64 (runtime SIMD dispatch)#13320

tscore: Highway-accelerated ASCII to_lower and base64 (runtime SIMD dispatch)#13320
phongn wants to merge 12 commits into
apache:masterfrom
phongn:highway-tolower-base64

phongn commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

phongn commented Jun 24, 2026

Summary

ASCII to_lower

base64

Build / wiring

Performance

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant