tscore: Highway-accelerated ASCII to_lower and base64 (runtime SIMD dispatch)#13320
Open
phongn wants to merge 12 commits into
Open
tscore: Highway-accelerated ASCII to_lower and base64 (runtime SIMD dispatch)#13320phongn wants to merge 12 commits into
phongn wants to merge 12 commits into
Conversation
The bulk ASCII tolower loop used to canonicalize the scheme and host portions of a URL before hashing into the cache key runs at ~1.5 GB/s scalar (one byte and one ParseRules table lookup per iteration). The work is trivially data-parallel and there is no per-byte branching, so a SIMD kernel that lowercases a whole register at once gives a straightforward speedup once the input is long enough to amortize the vector setup. Add a header-only helper ts::memcpy_tolower under include/tscore/ink_memcpy_tolower.h with a compile-time-selected cascade of SIMD bodies: 64-byte AVX-512BW, 32-byte AVX2, 16-byte SSE2 on x86_64, plus 16-byte NEON on ARMv8. Wider bodies fall through to narrower drain loops, so the worst-case scalar tail is always <16 bytes. Selection is purely compile-time; runtime ifunc dispatch is left for a follow-up. The AVX-512BW body uses _mm512_mask_add_epi8 to fuse the conditional "+0x20 where upper" into a single op, and a masked load/store handles 1..63 leftover bytes in a single SIMD pass (inspired by Tony Finch's copytolower64.c, https://dotat.at/cgi/git/vectolower.git/). The whole AVX-512BW block is gated at n >= 64 because the masked load/store has ~7 ns of fixed setup that loses to the narrower paths for short inputs; below 64 bytes we fall through to the AVX2 + SSE2 cascade. Semantics match the existing ParseRules::ink_tolower table exactly: bytes in 'A'..'Z' map to 'a'..'z', all others (including 0x80..0xFF) pass through unchanged. Replace the static inline memcpy_tolower in src/proxy/hdrs/URL.cc with this helper. Baseline x86_64 builds use the 16-byte SSE2 path; builds that opt into a wider -march (x86-64-v3 = AVX2, x86-64-v4 = AVX-512BW) get the wider bodies automatically. Sub-16-byte inputs (e.g. short HTTP schemes like "http") use the scalar tail and see no perf change. Measured throughput on a 2.0 GHz Ice Lake Xeon Gold 6338, mean ns: size scalar SSE2 AVX2 AVX-512BW ---- ------ ---- ---- --------- 16 B 10.4 2.15 1.75 1.98 32 B 15.4 2.90 2.24 2.31 64 B 28.0 4.43 2.85 2.61 256 B 113 13.87 7.57 6.20 1024 B 425 50.47 24.23 17.49 Speedup vs scalar at 1024 B: SSE2 8.4x, AVX2 17.5x, AVX-512BW 24.3x. A new microbenchmark under tools/benchmark covers correctness across sizes 0..257 (bracketing each SIMD body size) plus an exhaustive byte- value sweep that guards against any future widening of the case-fold range, alongside scalar-vs-SIMD throughput numbers and a config-print case that emits the selected ISA path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Move ts::memcpy_tolower correctness coverage out of the ENABLE_BENCHMARKS-gated benchmark and into a new src/tscore/unit_tests/test_ink_memcpy_tolower.cc so ctest exercises the scalar and SIMD paths in every build. Covers boundary sizes bracketing each SIMD body width, the exhaustive 0..255 byte-value sweep, and the in-place (dst == src) form (Copilot). - Fix the implementation-note comment on ts::memcpy_tolower to describe the actual AVX-512BW control flow (gated main loop + masked-tail load/store + early return), and document that in-place (dst == src) is supported on every path (Copilot). - Add a Catch::Benchmark::keep_memory barrier in benchmark_memcpy_tolower so the compiler can no longer DCE the inlined stores past the first observed byte (Copilot). - Migrate the in-place tolower loop in src/proxy/http3/QPACK.cc::_encode_header to ts::memcpy_tolower, demonstrating the in-place contract (bryancall). - Add Tony Finch's copytolower64.c attribution to NOTICE (masaori335). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
memcpy_tolower carried two warts: the "memcpy" prefix implied non-overlapping by convention with libc memcpy (we explicitly support the in-place case), and the unqualified name didn't surface the ASCII-only semantics. Rename the helper to ts::ascii::tolower_copy and add a thin ts::ascii::tolower_inplace(buf, n) wrapper so call sites that operate on a single buffer read naturally instead of passing the same pointer twice. Rename the header to include/tscore/ink_ascii_tolower.h, the unit test to src/tscore/unit_tests/test_ink_ascii_tolower.cc, and the benchmark to tools/benchmark/benchmark_ascii_tolower.cc to match. Update the two existing call sites (URL.cc fast-path scheme/host and QPACK::_encode_header in-place name lowercasing) accordingly. No behavior change: the helper bodies are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrate two more byte-at-a-time ASCII tolower loops to ts::ascii::tolower_copy. Both call sites use a separate destination buffer, so the copy form is the right fit: - hpack_encode_header_block(): lower-cases each MIMEField name before encoding to match the HTTP/2 lowercase-header-name requirement. - UrlRewrite::_mappingLookup(): lower-cases the incoming request host into a stack buffer before the table lookup, so the lookup is case-insensitive against the lower-cased keys built at config-load time. The previous code used libc tolower(int) on signed char values, which is technically UB for bytes >= 0x80; the new call avoids that. The existing unit tests in test_URL, test_HpackIndexingTable, and test_RemapRules executed the tolower paths but only with inputs that were already lower-case, so they would have missed a "skip the lowercasing" regression. Add focused behavioral coverage: - test_URL.cc: four extra get_hash_test_cases that hash a request with uppercase/mixed-case scheme or host and require an equal hash to the lower-case form. Includes a 49-byte uppercase host that crosses both the 16- and 32-byte SIMD bodies. - test_RemapRules.cc: a new SCENARIO that builds a UrlRewrite from a map for a lower-case host and requires that uppercase, mixed-case, and long-uppercase request hosts all match. - test_HpackIndexingTable.cc: a new TEST_CASE that encodes a long mixed-case field name with hpack_encode_header_block and requires the encoded byte stream to be identical to encoding the same field with an already-lower-case name. QPACK already exercises the in-place path through its Encoding test and the helper's own ts::ascii::tolower_inplace unit test covers in-place semantics exhaustively; an additional focused QPACK test would need the external .qif fixture infrastructure, which is out of scope here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an optional runtime-dispatched SIMD path for ats_base64_encode / ats_base64_decode, gated by the ENABLE_HIGHWAY_DISPATCH CMake option (off by default; the scalar path is unchanged when off). The scalar primitives move to ink_base64_scalar.h so the SIMD kernel's scalar tail and the scalar path share one definition and cannot drift. The kernels in ink_base64_dispatch.cc use Highway's portable SIMD ops (foreach_target + HWY_DYNAMIC_DISPATCH), so one source compiles for SSE4/AVX2/AVX-512 and the best target supported by the live CPU is chosen at runtime. The algorithms and lookup tables derive from the simdutf library (Mula/Lemire vectorized base64; aqrit's combined standard/URL-safe classifier), re-expressed in Highway; see NOTICE. Decode fuses validation into the SIMD loop (consuming only fully-valid 16-byte blocks and finishing the remainder, including any non-alphabet truncation, on the scalar tail), so output is byte-for-byte identical to the scalar decoder, including in-place use and mixed standard/URL alphabets. encode reuses the scalar encoder for the padded tail. Tests: unit_tests/test_ink_base64.cc cross-checks the public path against the scalar reference across sizes that straddle the SIMD thresholds, both alphabets, in-place decode, truncation at every position, and undersized output buffers; with the option on these become SIMD-vs-scalar parity checks. tests/fuzzing/fuzz_base64.cc adds a libFuzzer target that decodes untrusted input and cross-checks both paths under sanitizers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds SIMD-accelerated implementations of ASCII lowercasing (
ts::ascii::tolower_copy/tolower_inplace) and base64 encode/decode (ats_base64_encode/ats_base64_decode), built on Google Highway and selected at runtime by CPU capability. Both are gated behind a new build option that defaults to OFF — without it the scalar paths are used and there is no behavior change to existing builds.This combines the previously separate to_lower and base64 SIMD efforts into one series and folds in the review fixes applied on our internal branch.
ASCII to_lower
ts::ascii::tolower_copy(dst, src, n)/tolower_inplace(buf, n)ininclude/tscore/ink_ascii_tolower.h. FoldsA–Z→a–z; all other bytes (including 0x80–0xFF) pass through unchanged; no UTF-8 folding; in-place (dst == src) supported.ink_ascii_tolower_dispatch.cc(one source compiled for SSE4/AVX2/AVX-512/NEON viaforeach_target; the best target for the live CPU is chosen once and cached). When the option is off, a portable scalar loop is used.URL.cc),HPACK.cc,QPACK.cc,UrlRewrite.cc— with behavioral tests added alongside each (test_URL,test_RemapRules,test_HpackIndexingTable).base64
ink_base64_dispatch.{cc,h}), using the vectorized base64 algorithms from simdutf re-expressed in Highway (Muła/Lemire; aqrit's combined standard/URL-safe classifier).ink_base64_scalar.h, shared by the scalar path and the SIMD path's tail so the two cannot drift. Decode fuses validation into the SIMD loop and hands the remainder (including truncation at the first non-alphabet byte) to the scalar tail, so SIMD output is byte-for-byte identical to scalar — including in-place decode and mixed standard/URL-safe alphabets.ats_base64_decode: when the decodable prefix length was not a multiple of four, the old loop ran one iteration past the prefix (over-reading the input, and readinginBuffer[-2]). Decode now processes only whole 4-character groups plus an explicit 2/3-character tail. The decoded length and bytes are unchanged for every well-defined input.Build / wiring
ENABLE_HIGHWAY_DISPATCH(default OFF) gates the SIMD paths viaTS_HAS_HIGHWAY_DISPATCH;EXTERNAL_HWYselects an external Highway over the vendored copy.branch-highwayCMake preset builds with the option on, turning the unit tests into real SIMD-vs-scalar parity checks.NOTICEupdated to attribute simdutf and Google Highway.Performance
Measured on an Intel Xeon Gold 6338 (Ice Lake-SP, AVX-512), Release build (
-O3), Highway dispatching to its AVX-512 target. Baselines are the scalar paths these replace. The public APIs keep the scalar path below the SIMD thresholds (encode 24 B, decode 32 chars) to avoid dispatch overhead on tiny inputs, which is why the smallest sizes show little gain.ASCII tolower — ns per call, vs the byte-at-a-time
ink_tolowerloop:base64 decode — GB/s on input chars:
base64 encode — GB/s on input bytes:
Testing
test_ink_ascii_tolower.cc,test_ink_base64.cc) compare the public path against an independent scalar reference across sizes, alphabets, truncation, in-place, and buffer-bound cases; withENABLE_HIGHWAY_DISPATCH=ONthey become SIMD-vs-scalar parity tests.tests/fuzzing/fuzz_base64.cc: libFuzzer target that decodes untrusted input and cross-checks both paths under sanitizers.tools/benchmark/benchmark_ascii_tolower.ccreproduces the tolower numbers above.Notes
branch-highwaypreset to get parity coverage of the SIMD kernels.🤖 Generated with Claude Code