eth/protocols/wit, consensus/bor: WIT2 — BP-signed witness announcements with transitive relay and pre-import serving#2208
eth/protocols/wit, consensus/bor: WIT2 — BP-signed witness announcements with transitive relay and pre-import serving#2208lucca30 wants to merge 15 commits into
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
test |
Code Review3 issues found. Checked for bugs and CLAUDE.md compliance. 1. Performance: redundant witness encodingFile:
On a 50 MiB witness this adds ~100–300 ms of redundant CPU work per verified fetch — meaningful given WIT2's goal of eliminating per-hop latency. Suggested fix: Have 2. Performance: unconditional encode+hash before signed-announcement checkFile:
Every witness broadcast — including from WIT1 peers — pays the full encode+hash cost (~150–450 ms on 50 MiB witnesses) even when the result is never used. Suggested fix: Check 3. Bug: peer dropped on local EncodeRLP failureFile: When This is inconsistent with the pattern in Suggested fix: Change m.handleWitnessFetchFailureExt(hash, "", fmt.Errorf("witness encode failed: %w", err), false) |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #2208 +/- ##
===========================================
+ Coverage 53.67% 53.96% +0.28%
===========================================
Files 897 901 +4
Lines 160369 161206 +837
===========================================
+ Hits 86086 86994 +908
+ Misses 68946 68864 -82
- Partials 5337 5348 +11
... and 24 files with indirect coverage changes
🚀 New features to boost your workflow:
|
Review responsesAll 3 review issues addressed in 12368a3:
The CI status
|
Three high-severity issues from the Codex adversarial review on PR #2208, each with a TDD regression test added first then fixed: 1. Header-race: signed announce arriving before header was silently downgraded. handleSignedWitnessAnnouncements called peer.AddKnownAnnounce unconditionally before the verification gate, leaving a peer marked announce-known even on bad-signature / header-unknown rejection paths. That suppressed our own re-relay back to that peer if a valid version of the same hash arrived from someone else, killing the natural recovery path. Fix: gate AddKnownAnnounce on acceptSignedAnnouncement success so the announce-known bit only reflects verified delivery. Test: TestHandleSignedWitnessAnnouncementsBadSigDoesNotMarkAnnounceKnown. 2. pendingWitnessBodies TTL didn't actually evict. get() observed expiry and returned false but left the entry in the map; gcLocked only ran from put(), so a node that stopped receiving witnesses retained up to capacity (10) ~50MB blobs indefinitely. Fix: when get() observes an expired entry, upgrade to write lock and delete it (re-checking under the write lock to avoid clobbering a concurrent put). Test: TestPendingWitnessBodyCacheGetEvictsExpired. 3. Honest body-server dropped on bad producer commitment. verifyAgainstSignedHash dropped the byte-server on every signed-hash mismatch, but the announcement only proves *some* BP signed *some* hash — not that the hash matches the canonical witness. A faulty or malicious scheduled producer that signed a bogus hash would weaponise this to disconnect every honest peer serving the real witness. Fix: reject the bytes (don't cache for serving) and back off the request without dropping the byte-server. TODO comment left for follow-up signer-quarantine work, which needs (signer, relayer) provenance the manager doesn't currently have. Test: TestProcessWitnessResponseDoesNotDropOnByteMismatch (replaces the previous TestProcessWitnessResponseDropsOnHashMismatch, whose policy this commit reverses).
Adversarial-review findings — addressed via TDDCodex flagged 3 high-severity issues in an adversarial pass. For each I wrote a failing regression test first, then implemented the smallest correct fix, then confirmed all 919 tests across
What's still out of scope (filed as TODOs at the call site, not in this PR)
|
Adversarial review — all 4 items now closedPushed b3cb00e: deferred-announce queue closes the cosend race.
A small in-memory cache (capacity 256, TTL 30s) holds signed announcements whose producer-binding could not be checked at receive time because the matching block header wasn't local yet. handler.Start subscribes to ChainHeadEvent; on each new block, drainDeferredAnnouncesFor(blockHash) re-runs verification, caches the announcement on success, credits the original sender as announce-known, and relays. Mirrors the existing pendingWitnessBodyCache lifecycle. isScheduledProducer was also tightened: header presence is now checked first regardless of consensus engine. The previous non-bor early-return skipped the header check entirely, which was incorrect on its own — an announce we can't tie to a local block is unverifiable here. 920/920 tests pass across eth, eth/fetcher, eth/protocols/wit, consensus/bor, core/stateless. go build ./... clean. Still deferred (not adversarial-review-flagged): signer-quarantine ban-list (needs signer/relayer provenance plumbed to the witness manager — TODO at the call site); byte-budgeted cap on pendingWitnessBodies (count cap of 10 with working TTL is bounded; future-proofing item). |
|
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
…ncements with transitive relay and pre-import serving Adds WIT2 (protocol version 3): block producers sign a chunked-parallel commitment over each witness, peers verify the signature and relay the announcement at network-RTT speed without execution, and any peer holding the body can serve it pre-import from an in-memory cache. Byte-correctness is verified by requesters against the BP-signed WitnessHash, attaching tampering blame to the server; content-correctness (state-root) failures attach to the BP. Removes the per-hop ~500 ms execution gate that today serialises witness propagation through stateless validators. Witness commitment uses 1 MiB chunked-parallel keccak (keccak256 of the concatenation of per-chunk hashes), measured at ~13.5 ms wall-clock for 50 MiB witnesses on 8 cores vs ~88 ms single-shot. Wire format and signature shape are unchanged from a single-keccak commitment; only the function mapping bytes to the 32-byte commitment changes. Producer-side signing reuses the engine SignerFn via consensus/bor.SignBytes with a dedicated mimetype (application/x-bor-wit2-announce) and a domain-separated digest tag, replay-resistant at both the digest and signer-call levels. Receivers verify ecrecover against the scheduled producer for the announced block; announces for blocks whose header is not yet locally available are deferred (no strike) so the block-cosend race does not punish honest relayers. Pre-import serving cache (capacity 10) is fed from the paged-fetch path the moment byte-correctness check passes, before chain write. Cache entries are gated on a BP-signed WitnessHash being on file — relayers never cache unverified bytes, and WIT1 fallback paths skip the cache entirely. handleGetWitness consults the cache before chain storage. Wire: new protocol version WIT2 = 3, new message SignedNewWitnessHashesMsg = 0x06 with up to 64 announcements per packet. WitnessMetadataResponse extended with WitnessHash. WIT1 peers continue using NewWitnessHashes; mixed mesh tolerated. Rate-limits: 200 ms per-(blockHash, peer) relay rate-limit, 30 s announce TTL, per-peer token bucket (burst 256, refill 64/s), strike disconnect at 5 invalid signed announces per minute. Conflicting WitnessHash for the same BlockHash is rejected via signedWitnessCache.putIfNewer. Operator note: validators running Clef as their signer must whitelist the mimetype application/x-bor-wit2-announce; without it the producer falls back to unsigned WIT1 announces.
- eth/handler_wit2.go: remove unused errInvalidSigner, contextBackground, wit2SpanLookupMissMeter, and now-unused context import - core/stateless/witness_commit_bench_test.go: drop redundant c := c loop-var copies (Go 1.22+ copyloopvar) - goimports formatting on accounts/accounts.go, witness_commit_bench_test.go, witness_commit_helpers_test.go, eth/fetcher/witness_manager.go, eth/fetcher/witness_manager_wit2_test.go, eth/handler_wit2.go, eth/protocols/wit/protocol.go
… drop - eth/fetcher/witness_manager.go: verifyAgainstSignedHash now returns the canonically-encoded body and signed hash on success, so the pre-import serving cache no longer re-encodes the same witness (~14 ms saved per verified fetch on 50 MiB witnesses). cacheVerifiedWitnessForServing takes the precomputed body directly. - eth/fetcher/witness_manager.go: local EncodeRLP failure inside verifyAgainstSignedHash no longer drops the peer — re-encoding bytes the peer already delivered as valid RLP is a local invariant violation, not peer misbehavior. Mirrors the pattern already used by the cache path. - eth/handler_wit.go: hoist signedWitnesses.get(hash) above the EncodeRLP + WitnessCommitHash work in handleBroadcastWitness. WIT1 broadcasts (no signed announcement on file) used to pay the full encode+hash cost only to discard the result; now they short-circuit. - eth/fetcher/witness_manager_wit2_test.go: rename + retarget the no-signed-hash regression test onto verifyAgainstSignedHash, where the invariant now lives.
Three high-severity issues from the Codex adversarial review on PR #2208, each with a TDD regression test added first then fixed: 1. Header-race: signed announce arriving before header was silently downgraded. handleSignedWitnessAnnouncements called peer.AddKnownAnnounce unconditionally before the verification gate, leaving a peer marked announce-known even on bad-signature / header-unknown rejection paths. That suppressed our own re-relay back to that peer if a valid version of the same hash arrived from someone else, killing the natural recovery path. Fix: gate AddKnownAnnounce on acceptSignedAnnouncement success so the announce-known bit only reflects verified delivery. Test: TestHandleSignedWitnessAnnouncementsBadSigDoesNotMarkAnnounceKnown. 2. pendingWitnessBodies TTL didn't actually evict. get() observed expiry and returned false but left the entry in the map; gcLocked only ran from put(), so a node that stopped receiving witnesses retained up to capacity (10) ~50MB blobs indefinitely. Fix: when get() observes an expired entry, upgrade to write lock and delete it (re-checking under the write lock to avoid clobbering a concurrent put). Test: TestPendingWitnessBodyCacheGetEvictsExpired. 3. Honest body-server dropped on bad producer commitment. verifyAgainstSignedHash dropped the byte-server on every signed-hash mismatch, but the announcement only proves *some* BP signed *some* hash — not that the hash matches the canonical witness. A faulty or malicious scheduled producer that signed a bogus hash would weaponise this to disconnect every honest peer serving the real witness. Fix: reject the bytes (don't cache for serving) and back off the request without dropping the byte-server. TODO comment left for follow-up signer-quarantine work, which needs (signer, relayer) provenance the manager doesn't currently have. Test: TestProcessWitnessResponseDoesNotDropOnByteMismatch (replaces the previous TestProcessWitnessResponseDropsOnHashMismatch, whose policy this commit reverses).
Fourth and final adversarial-review item. Block + signed-announce gossip streams travel independently and can reach a node in either order. When the announce arrives first, isScheduledProducer returns (ok=false, headerAvailable=false) and the previous code dropped the announcement on the floor — relying on mesh re-gossip to reconstruct the signed-hash for that block. In sparse meshes (single-cosend window, small fanout) re-gossip never fires and subsequent witness fetches silently fall back to the unsigned WIT1 path, leaking the WIT2 byte-verification guarantee for that block. This commit holds the announcement instead of dropping it: - New deferredAnnounceCache mirrors pendingWitnessBodyCache: capacity 256, TTL = wit2AnnounceTTL (30s), oldest-evict, in-place expiry on take(). - acceptSignedAnnouncement's deferral branch now puts the announcement into deferredAnnounces. - New drainDeferredAnnouncesFor(blockHash) re-runs verification for the matching announcement, caches it on success, credits the original sender as announce-known, and relays. On still-header-unknown (rare: the chain-head fired but the indexed header isn't reachable yet by hash) the entry is re-stashed to ride the next chain-head event. - handler.Start subscribes to ChainHeadEvent and runs deferredAnnouncesLoop, which calls drainDeferredAnnouncesFor on each imported block. handler.Stop unsubscribes via quitSync. isScheduledProducer was reordered to check header presence first regardless of consensus engine. The previous early-return for non-bor test chains skipped the header check entirely, which was incorrect on its own (an announce we can't tie to a local block is unverifiable here) and prevented unit tests from exercising the deferral path. Bor producer recovery still runs only when a bor engine is present. Test: TestDeferredSignedAnnounceDrainedAfterHeaderArrives covers the full lifecycle — announce arrives header-unknown (deferred, not cached, sender not credited), header lands, drain runs, announcement is now cached and the deferred entry is consumed.
…ce entries (W-1) handleGetWitness: bound len(WitnessPages) to MaxWitnessPagesServe. A request packed with unknown hashes or out-of-range pages accumulates zero data bytes and trips neither byte guard, while still forcing one DB size lookup per distinct hash and one response entry per page — a CPU/IO/alloc amplification vector. Legitimate requests carry a single page, so the bound is never approached. Also fix an error-string typo. deferredAnnounceCache: add a per-peer live-entry cap (capacity/8) so a single peer cannot saturate the deferred-announce queue and evict honest header-racing announces. The cache is keyed by blockHash, so bounding the claimed BlockNumber is no defence (an attacker reuses a near-tip number with distinct fake hashes); the per-peer cap is the bound that holds. Per-peer accounting is maintained across put/take/gc/evict; add wit2DeferredPerPeerDropMeter. Tests: TestHandleGetWitness_PageCountBound, TestDeferredAnnounceCachePerPeerCap.
Fixes the stateless-consumer sync regression where, in an all-WIT2 fleet,
a node always fetches the witness body from an announce-only relayer (no
peer is ever marked as a body-holder) and re-polls it with empty
GetWitness until that relayer obtains the body. WIT1 stays in lockstep
because its hash-announce both implies the sender holds the body and marks
it as a holder, so the first pull lands; WIT2 relays the signed announce
ahead of the body, leaving the consumer to poll.
- handler: record peers that ask for a body we don't yet hold but have a
BP-signed announcement for (witnessWaiterRegistry, bounded + 30s TTL),
and push the full witness to them the moment we obtain it. Three triggers
cover how a node comes to hold a body: our own verified fetch
(cacheVerifiedWitnessForServing), a gossip broadcast
(handleWitnessBroadcast), and — the dominant case for full/producing
nodes — generating it during native block import, flushed from the
chain-head loop (flushWitnessWaitersForImported). Restores the WIT1-style
hand-off without flooding: at most one body per peer that actually asked.
- fetcher: route empty ("body not ready yet") responses to a dedicated
backoff (first retries immediate, then exponential to a 1s cap) instead
of a tight ~gatherSlack re-poll; never drop the request (the witness
provably exists) or penalise the responder.
Covered by TestEmptyGetWitnessForSignedHashPushesBodyOnArrival,
TestFlushWitnessWaitersForImportedPushesFromChainStorage, and
TestEmptyResponseBacksOffToAvoidHammering (all fail before the fix).
b3cb00e to
fe20334
Compare
Follow-up: stateless-sync fix + devnet validation (
|
| metric | before fix | after fix |
|---|---|---|
| S1 milestone-lag p50 | 646 ms | 0.9 ms |
| stateless height vs BP1 (at teardown) | 4–6 blocks behind | 0–1 |
| S1 blocks served in ≤ 2 fetches | — | 35 / 38 (92 %) |
The median-lag collapse (646 ms → 0.9 ms) is only explicable if the body is pushed the instant BP1 imports the block, rather than discovered on a later poll. All stateless nodes end height-locked within 0–1 of BP1, versus 4–6 behind before.
It took two iterations: a first version with only the fetch + broadcast triggers showed no improvement, because S1's peer (a producing node) obtains witnesses by generating them on import — which fired neither hook. The chain-head trigger closed that gap.
Honest residual
A p95 tail of ~1.3–2 s remains, from a few straggler blocks per node (3 of 38 on S1) where the consumer's sole relayer is itself slow to obtain the block — no push can deliver a body the relayer doesn't yet hold. This is partly inherent to single-source stateless sync and is present in pre-regression (develop-era) runs too; it is not the witness-fetch regression. Numbers are from a single short run — the p50 signal is robust, the p95 noisier.
Tests
TestEmptyGetWitnessForSignedHashPushesBodyOnArrival, TestFlushWitnessWaitersForImportedPushesFromChainStorage, and TestEmptyResponseBacksOffToAvoidHammering — each fails before the fix and passes after.
…g test TestReconstructWitness/DuplicatePageMisreconstructs documents that the multi-page path has no (hash,page) dedup: a duplicated page index satisfies the TotalPages count with a real page missing and reassembly fires over the wrong byte stream. Pinned for a future dedup fix; not fixed here (out of scope). See PR discussion.
Two hardening fixes on the witness broadcast/push path (convergence round 1): - pushWitnessToWaiters now refuses to full-push a witness whose canonical encoding exceeds witnessPushMaxSize (wit message cap minus envelope margin). The receiver enforces a 16MB cap on inbound wit messages, so an oversized NewWitness push would get this node dropped as a protocol violator by the very peers it is trying to help. Oversized witnesses stay on the paged pull path; the bytes are servable by the time a push could fire, so the waiter's backed-off poll succeeds. New meter: eth/wit2/serve/waiter_push_oversize. - handleWitnessBroadcast is now verify-or-drop. Bytes contradicting a BP-signed witnessHash on file are fully rejected (previously they skipped the serving cache but were still injected into the block fetcher and the sender was marked as a body-holder - a bypass of the byte verification the paged-fetch path enforces). With no signed announcement on file the broadcast is only accepted for a locally known block header, restoring the known-block binding the inline comment promised. AddKnownWitness moved to accept paths only. New meter: eth/wit2/serve/broadcast_unknown_header_drop. Tests: TestHandleWitnessBroadcastByteMismatchNotInjected, TestHandleWitnessBroadcastDropsUnknownHeader, TestWaiterPushSkipsOversizedWitness.
The verify-or-drop hardening of handleWitnessBroadcast dropped the one NewWitness push that matters: a stateless consumer at the tip has, by definition, not imported the block it needs the witness for - so its header is unknown and the signed announce sits in deferredAnnounces (producer binding needs the header). Both acceptance paths missed and the pushed body was discarded, re-opening the stateless lag the push was added to cure (convergence round 2 catch). Add an import-only acceptance tier between the two: a broadcast whose bytes match a fresh deferred commitment is injected into the block fetcher so the pending block can import - import re-verifies everything via stateless execution and the state-root check - and the sender is marked a body-holder. It is NOT cached for serving, NOT promoted into signedWitnesses, and NOT relayed: those carry the verified-announce trust property, and a deferred entry's producer is unverified until the post-import drain checks it against the chain-validated header. Verifying against the header embedded in the pushed witness instead would let a peer self-seal a fabricated header and pass its own announce as the producer's. The deferred entry is read with a non-consuming peek so the drain still runs. Bytes contradicting the deferred commitment still drop. New meter: eth/wit2/serve/broadcast_deferred_import_only. Test: TestHandleWitnessBroadcastAcceptedWhileAnnounceDeferred.
At the stateless tip the deferred state is structural, not transient: a signed announce cannot be producer-verified before its block imports, the block cannot import without the witness, and a deferred (unverified) announce deliberately marks no peer announce-known. getOnePeerWithWitness therefore has no candidate, and for witnesses above the full-push size cap there is no NewWitness push either - leaving the consumer with no body source at all (convergence round 3 catch). resolveWitnessFetchPeer now falls back to the peer recorded on the fresh deferred entry (peek also returns the relayer ID): the relayer that announced the witness is on the propagation path and is exactly who a pull should target. Its bytes are NOT byte-checked against the deferred commitment - an unverified announcement must not be able to veto or bless data, else a Byzantine announce could reject the real witness pages. Import (stateless execution + state-root check) remains the verifier, exactly as on every WIT1 fetch. Test: TestResolveWitnessFetchPeerFallsBackToDeferredAnnouncer.
Every node with an authorized signer (i.e. every validator) signed WIT2 announcements for every block it announced or cosent — including blocks other validators produced. Receivers bind announce-signer to the header sealer and strike on mismatch, so on a devnet with all-validator WIT2 peers, honest validators strike-disconnected each other roughly every 20s (observed: 21-32 strike_disconnects per validator per 8min run). The self-signed foreign announce was also cached in signedWitnesses, where it can shadow the producer's real announcement (putIfNewer dedups by block hash) and suppress its transitive relay. Gate the sign path on the same producer binding the receive side enforces: signLocalWitnessAnnouncement refuses any block the local signer did not seal (maySignAnnouncementForBlock, a thin wrapper over verifyScheduledProducer). For foreign blocks the announce and cosend paths fall back to the unsigned WIT1 hash announce, which is truthful — both paths are gated on HasWitness.
Patch coverage 64.75% -> ~91% (codecov target 90%): - new unit tests for the wit2 caches/registries (rate-limit tracker, waiter registry caps/expiry, deferred-announce cache lifecycle, signed-witness cache), the announce receive path (accept/dedup/relay, rate-limit drop, number-mismatch strike), the cosend and BroadcastBlock per-version announce split, the chain-head deferred drain, and the wit wire path (signed-announce round-trip, malformed-packet rejection, queue-full drop) - codecov: ignore eth/peer_mock.go (generated gomock, like the other ignored mocks) Sonar new-code duplication 4.9% -> ~2% (threshold 3%): - shared fixtures in witness_manager_wit2_test.go (primePendingWitness, witnessResponse, encodedCommitHash), handler_wit2_test.go (persistedSignedWitness, requestFirstWitnessPage), and witness_bench_test.go (benchWitnessSizes) diffguard: - split handler_wit2.go (988 lines) into handler_wit2.go + handler_wit2_bodies.go + handler_wit2_announces.go, all under the 500-line threshold - handleWitnessBroadcast (96 lines, complexity 17) extracted into three accept-path helpers; deferredAnnounceCache.put eviction scan extracted (complexity 13 -> under 10) No behavioral changes outside test code: the broadcast/eviction refactors are mechanical extractions covered by the existing and new tests.
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
…and file size - Extract hashWitnessChunks/witnessCommitWorkerCount from WitnessCommitHash (complexity 18 -> under threshold); commitment recipe unchanged, pinned by the existing shape tests. - Move the PR-added WIT2 fetcher code (verifyAgainstSignedHash, cacheVerifiedWitnessForServing, handleWitnessBodyNotReady, empty-response backoff and its constants) into eth/fetcher/witness_manager_wit2.go so witness_manager.go stays within the oversized-file growth tolerance. - Add coverage for the patch lines CI flagged: SignBytes error paths and CurrentSigner, WitnessCommitHashFromWitness, the TTL gc sweep of all four wit2 caches, drainDeferredAnnouncesFor guards, AddKnownAnnounce, and the announce-packet decode failure.
…ed-announce # Conflicts: # eth/peerset_test.go
|
|
@claude review |
There was a problem hiding this comment.
Pull request overview
This PR introduces WIT2 (wit protocol v3) to speed up witness propagation by allowing block-producer–signed witness announcements to be verified and relayed without executing/importing blocks, plus an in-memory pre-import witness-body serving cache to reduce multi-hop latency for stateless validators.
Changes:
- Add WIT2 wire protocol support (new signed-announce message, signing digest helpers, peer queues, handler routing).
- Implement WIT2 relay/verification path, deferred-announce handling, rate limiting/strikes, and a pre-import witness-body cache with waiter push semantics.
- Add a chunked-parallel witness commitment hash (
WitnessCommitHash) and make witness RLP encoding deterministic (sorted state nodes), with extensive tests/benches.
Reviewed changes
Copilot reviewed 39 out of 40 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| eth/protocols/wit/protocol.go | Adds WIT2 version/message, signed announcement types, and signing preimage/hash helpers. |
| eth/protocols/wit/protocol_wit2_test.go | Pins WIT2 signing-hash format and handshake advertising. |
| eth/protocols/wit/peer.go | Adds announce-known cache + signed-announce queueing/sending for WIT2 peers. |
| eth/protocols/wit/peer_wit2_test.go | Exercises signed-announce wire round-trip and queue/known-set semantics. |
| eth/protocols/wit/handlers.go | Adds decode-time cap and handler for SignedNewWitnessHashes (WIT2). |
| eth/protocols/wit/handler.go | Adds WIT2 handler map and dispatch by negotiated version. |
| eth/protocols/wit/broadcast.go | Broadcast loop now sends queued signed announcements. |
| eth/peerset.go | Adds announce-only fallback peer selection + peersWithoutSignedAnnounce for relay dedup. |
| eth/peerset_test.go | Tests preference/fallback behavior for witness fetch peer selection. |
| eth/peer.go | Extends WitnessPeer interface with WIT2 announce-known operations + signed announce send. |
| eth/peer_test.go | Adds regression test documenting duplicate-page misreconstruction gap. |
| eth/peer_mock.go | Updates gomock witness-peer mock to include new WIT2 methods. |
| eth/handler.go | Wires WIT2 caches/trackers/waiters and hooks into BlockFetcher construction + head loop. |
| eth/handler_wit2.go | Implements WIT2 signing/verification helpers, strike policy, cosend, and deferred drain loop. |
| eth/handler_wit2_caches_test.go | Comprehensive tests for WIT2 caches, deferral, rate limits, strikes, and serving behavior. |
| eth/handler_wit2_bodies.go | Implements pending witness-body cache + witness-waiter registry. |
| eth/handler_wit2_announces.go | Implements per-peer token bucket + strike tracker, deferred announce cache, signed announce cache. |
| eth/handler_wit.go | Adds signed-announce handling, pre-import serving integration, request entry cap, and serving changes. |
| eth/handler_wit_test.go | Adds tests for new request bounds and deferred-relayer fetch fallback. |
| eth/handler_eth.go | Adds resolveWitnessFetchPeer with deferred-relayer fallback. |
| eth/fetcher/witness_manager.go | Adds WIT2 empty-response behavior, byte-check hook, and serving-cache callback integration. |
| eth/fetcher/witness_manager_wit2.go | Implements WIT2 byte-correctness verification, empty-response backoff, and serving-cache callback. |
| eth/fetcher/witness_manager_wit2_test.go | Tests WIT2 byte-check, empty-response backoff, and cache callback wiring. |
| eth/fetcher/witness_manager_test.go | Updates witness-manager tests for new constructor parameters. |
| eth/fetcher/metrics.go | Adds metric for witness byte mismatch (WIT2). |
| eth/fetcher/block_fetcher.go | Extends BlockFetcher ctor to pass WIT2 signed-hash lookup + cache callback into witness manager. |
| eth/fetcher/block_fetcher_test.go | Updates BlockFetcher tests for new constructor signature. |
| eth/fetcher/block_fetcher_race_test.go | Updates race tests for new constructor signature. |
| core/stateless/witness_commit.go | Adds chunked-parallel witness commitment hash (WIT2 WitnessHash). |
| core/stateless/witness_commit_test.go | Tests commitment determinism, worker invariance, and shape properties. |
| core/stateless/witness_commit_helpers_test.go | Adds benchmark/candidate helpers for witness-commit evaluation. |
| core/stateless/witness_commit_bench_test.go | Adds benches for candidate schemes + correctness guards. |
| core/stateless/witness_bench_test.go | Adds benches for witness encode/hash/sign cost by size. |
| core/stateless/encoding.go | Makes witness EncodeRLP deterministic by sorting state nodes. |
| core/stateless/encoding_test.go | Tests EncodeRLP determinism across insertion order and repeat calls. |
| consensus/bor/signbytes_test.go | Adds tests for SignBytes mimetype forwarding/rejection and signer handling. |
| consensus/bor/bor.go | Adds SignBytes + CurrentSigner APIs for WIT2 signing. |
| codecov.yml | Excludes generated mock from Codecov. |
| accounts/accounts.go | Adds mimetype constant for WIT2 witness announcement signing. |
Files not reviewed (1)
- eth/peer_mock.go: Generated file
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func verifySignedAnnouncement(ann wit.SignedWitnessAnnouncement) (common.Address, error) { | ||
| if len(ann.Signature) != wit.SignatureLength { | ||
| return common.Address{}, errInvalidSignatureLength | ||
| } | ||
| digest := wit.WitnessAnnouncementSigningHash(ann.BlockHash, ann.BlockNumber, ann.WitnessHash) | ||
| pubkey, err := crypto.Ecrecover(digest.Bytes(), ann.Signature) | ||
| if err != nil { | ||
| return common.Address{}, err | ||
| } | ||
| var addr common.Address | ||
| copy(addr[:], crypto.Keccak256(pubkey[1:])[12:]) | ||
| return addr, nil | ||
| } |
| // Cost: ~150ms keccak over a 50MB witness, plus ~100μs ECDSA. Off the | ||
| // block-production critical path; runs once per produced block on the | ||
| // announce path. |
| // On verification failure (bad signature, unknown signer) the sender is | ||
| // **not** dropped at this layer — they may simply be relaying a bad upstream | ||
| // announcement. Drops are reserved for byte-correctness failures at fetch | ||
| // time. We do, however, count invalid announcements via metrics to surface | ||
| // misbehaving relayers. |
| if totalPages == 0 { | ||
| if _, hasSigned := (*handler)(h).signedWitnesses.get(witnessPage.Hash); hasSigned { | ||
| (*handler)(h).witnessWaiters.record(witnessPage.Hash, peer) | ||
| } | ||
| } |
| if existing, exists := c.entries[ann.BlockHash]; exists { | ||
| // Overwrite for the same hash: net-zero slot change. Move per-peer | ||
| // credit if a different peer now carries this announce forward. | ||
| if existing.peerID != peerID { | ||
| c.decPeerLocked(existing.peerID) | ||
| c.perPeer[peerID]++ | ||
| } | ||
| c.entries[ann.BlockHash] = &deferredAnnounceEntry{ | ||
| announcement: ann, | ||
| peerID: peerID, | ||
| receivedAt: time.Now(), | ||
| } | ||
| return | ||
| } | ||
|
|
||
| // New hash for this peer: enforce its share of the queue so no single peer | ||
| // can monopolise the cache and evict honest header-racing announces. | ||
| if c.perPeer[peerID] >= c.perPeerCap { | ||
| wit2DeferredPerPeerDropMeter.Mark(1) | ||
| return | ||
| } |
There was a problem hiding this comment.
🔴 deferredAnnounceCache.put has two defects on the existing-key (overwrite) branch (eth/handler_wit2_announces.go:173-186): (1) it unconditionally overwrites the cached announcement without verifying that the new WitnessHash matches the existing one — asymmetric with signedWitnessCache.putIfNewer which DOES reject WitnessHash conflicts at lines 313-332; (2) it increments c.perPeer[peerID] at line 178 with no perPeerCap check, defeating the explicit DoS bound the function comment commits to (the cap check at line 190 only runs on the new-hash branch). Together, any peer within ordinary rate-limit budget can silently downgrade WIT2 byte-correctness for arbitrary deferred blocks: re-announce a victim hash with a forged signature, the overwrite replaces the honest entry without per-peer accounting protection; on block import the attacker's entry fails producer-binding and is dropped, signedWitnesses stays empty, witness fetches skip byte-verification (parentSignedWitnessHash returns ok=false → verifyAgainstSignedHash short-circuits). Fix mirrors the sibling cache: reject overwrites whose WitnessHash differs from the existing entry (surface via wit2ConflictingWitnessHashMeter), AND enforce the perPeerCap check before perPeer[peerID]++ when existing.peerID != peerID.
Extended reasoning...
What the bug is
deferredAnnounceCache.put at eth/handler_wit2_announces.go:173-205 combines two distinct flaws on the overwrite branch:
(1) No WitnessHash conflict check. When an entry for ann.BlockHash already exists, the function unconditionally overwrites the announcement with whatever the new caller passed, with no check that the new WitnessHash matches the existing one. The sibling signedWitnessCache.putIfNewer at lines 313-332 enforces exactly this guard and surfaces conflicts via wit2ConflictingWitnessHashMeter — but the deferred cache adopts the opposite policy.
(2) Per-peer cap bypass. The check if c.perPeer[peerID] >= c.perPeerCap (line 190) sits on the new-hash branch only. The overwrite branch at line 178 does c.perPeer[peerID]++ unconditionally when existing.peerID != peerID, with no cap check. The comment block above the function explicitly markets the per-peer cap as the structural defense ensuring "no single peer can monopolise the cache and evict honest header-racing announces" — but on the overwrite branch that invariant does not hold.
Why existing code does not prevent it
verifySignedAnnouncement (eth/handler_wit2.go:67-79) recovers from ANY 65-byte signature — it has no signer-membership gate, because at deferral time the header is unknown by definition. acceptSignedAnnouncement (eth/handler_wit.go) routes the deferral path silently: no strike, no AddKnownAnnounce, no relay. The attacker stays within their wit2AnnounceBurstCap=256 / 64-per-second rate budget and never accumulates strikes.
Step-by-step proof (the merged attack)
- Honest P1 receives a legitimate signed announce for blockHash X with WitnessHash Y (BP-signed by the real producer), header for X not yet local on the victim. The victim defers via
deferredAnnounces.put(ann_Y, P1)— line 199, new-hash path,perPeer[P1] = 1. - Attacker P2 crafts a structurally valid 65-byte ECDSA signature over
(X, X_blockNumber, Z)for any chosenWitnessHash Zusing a throwaway key.verifySignedAnnouncementrecovers some address,isScheduledProducerreturns(false, false)(header unknown), and the victim falls into the deferral branch. - The put hits the existing-key path.
existing.peerID (P1) != peerID (P2)→decPeerLocked(P1)thenperPeer[P2]++. The honest entry is silently replaced with(ann_Z, P2)— no WitnessHash check (defect 1). - P2 repeats steps 2-3 against other honest peers entries. Each overwrite increments
perPeer[P2]unconditionally pastperPeerCap(defect 2): with capacity 256 andperPeerCap = 256/8 = 32, P2 can grow to ~256 entries by chaining cross-peer overwrites, bounded only by the global capacity. - Block X arrives.
drainDeferredAnnouncesFor(X)take()s the attacker entry, re-runsverifySignedAnnouncement(still structurally valid → attacker addr),isScheduledProducernow returns(false, true)since the header is local and attacker != producer. The entry is dropped without promotion (signedWitnessesis never written). The honest P1 entry is gone — overwritten in step 3. lookupSignedWitnessHash(X)returns(_, false).witnessManager.verifyAgainstSignedHash(eth/fetcher/witness_manager_wit2.go) sees no signed hash on file, returnsbody=nilfor the WIT1 path, andcacheVerifiedWitnessForServingno-ops. Subsequent serving peers get bytes accepted without verification against any BP commitment for block X.
Impact
- Defeats the byte-correctness trust property the PR explicitly markets as the WIT2 improvement (bytes verified against the BP-signed
WitnessHashpre-execute). - Defeats the explicit DoS bound the function comment block at lines 117-126 commits to ("no single peer can monopolise the cache and evict honest header-racing announces").
- Within budget for any peer on the gossip mesh — no producer key needed, no strikes accumulate.
- Recoverable in dense honest meshes via re-gossip after import, but fails in sparse / single-route topologies (the exact scenario the PR description warns about for stateless validators behind a single relay).
- Not catastrophic on its own — state-root validation at import still catches bad content, and the failure mode is graceful degradation to WIT1 byte handling. But it silently negates one of the headline security improvements of the PR.
Fix
Mirror the sibling cache policy and enforce the per-peer cap on the overwrite branch:
if existing, exists := c.entries[ann.BlockHash]; exists {
if existing.announcement.WitnessHash != ann.WitnessHash {
wit2ConflictingWitnessHashMeter.Mark(1)
return
}
if existing.peerID != peerID {
if c.perPeer[peerID] >= c.perPeerCap {
wit2DeferredPerPeerDropMeter.Mark(1)
return
}
c.decPeerLocked(existing.peerID)
c.perPeer[peerID]++
}
c.entries[ann.BlockHash] = &deferredAnnounceEntry{ ... }
return
}The two guards compose cleanly: the first kills the byte-correctness downgrade vector (defect 1), the second kills the per-peer monopolization vector (defect 2). Both fixes are symmetric with how the sibling signedWitnessCache.putIfNewer and the new-hash branch already handle the same conflicts.
| func (c *pendingWitnessBodyCache) put(blockHash common.Hash, bytes []byte, witnessHash common.Hash) { | ||
| c.mu.Lock() | ||
| defer c.mu.Unlock() | ||
| c.gcLocked() | ||
| if len(c.entries) >= c.capacity { | ||
| // Evict the oldest entry. Linear scan is fine at the configured cap. | ||
| var oldestHash common.Hash | ||
| var oldest time.Time | ||
| for h, e := range c.entries { | ||
| if oldest.IsZero() || e.receivedAt.Before(oldest) { | ||
| oldest = e.receivedAt | ||
| oldestHash = h | ||
| } | ||
| } | ||
| delete(c.entries, oldestHash) | ||
| } | ||
| c.entries[blockHash] = &pendingWitnessBody{ | ||
| bytes: bytes, | ||
| witnessHash: witnessHash, | ||
| receivedAt: time.Now(), | ||
| } | ||
| } |
There was a problem hiding this comment.
🟡 Minor: pendingWitnessBodyCache.put at eth/handler_wit2_bodies.go:62-83 runs the capacity-driven eviction unconditionally, even when the put is an overwrite for a key already in the cache. When the cache is full (capacity = 10) and a second put arrives for an already-present blockHash — which can happen when cacheVerifiedWitnessForServing (manager fetch success) and acceptSignedBroadcast (handler_wit.go:153, on a near-simultaneous broadcast push) both fire for the same block — the eviction loop drops an unrelated oldest entry and then overwrites the existing slot in place, silently shrinking the live cache to 9 entries. The sibling deferredAnnounceCache.put already handles this correctly (early return on existing-key without consulting capacity). Fix: gate the eviction block on if _, exists := c.entries[blockHash]; !exists so it only fires for genuinely new keys. Functionally minor — the cache is a soft serving hint and self-corrects on the next put — but the asymmetry with the deferred cache reads as oversight and the fix is one line.
Extended reasoning...
What the bug is
pendingWitnessBodyCache.put (eth/handler_wit2_bodies.go:62-83) gates eviction on len(c.entries) >= c.capacity alone, without checking whether blockHash is already in the map. When the cache is full and the incoming put is an overwrite for an existing key, the eviction loop fires anyway: it picks the oldest entry across all hashes, deletes it, and then c.entries[blockHash] = ... simply replaces the existing slot. Net effect: one unrelated live entry was dropped to make room for an entry that didn't actually need a new slot.
The sibling deferredAnnounceCache.put (eth/handler_wit2_announces.go:138-156) handles the same shape correctly — it early-returns on the existing-key branch before consulting the capacity check, since an overwrite is net-zero for slot count. The asymmetry between the two caches is what reads as an oversight.
Why it can manifest in practice
Two distinct call sites can drive pendingWitnessBodies.put for the same blockHash within the same TTL window:
-
witnessManager.cacheVerifiedWitnessForServing→handler.cacheVerifiedWitnessForServing→pendingWitnessBodies.put(eth/handler_wit2.go), fired when the fetcher's paged-pull completes and the bytes verify against the BP-signedWitnessHash. -
witHandler.acceptSignedBroadcast(eth/handler_wit.go:153), fired when aNewWitnessbroadcast arrives for the same block and the bytes verify against the same signed commitment.
The race window is real on the WIT2 fast path: a stateless consumer can complete its paged fetch and also receive an unsolicited broadcast for the same block from a different peer in the cosend storm, in either order. The second put for that block becomes an overwrite at capacity.
Step-by-step proof
Capacity is 10 (witnessBodyCacheCapacity). Steady state: blocks A1..A10 in the cache, A2 the oldest.
put(A1, newBytes, newHash)is called (second arrival for A1 — fetch already cached it, now a broadcast push arrives, or vice versa).gcLocked()runs; assume nothing expired.len(c.entries) == 10 >= c.capacityevaluates true.- The eviction loop scans
c.entries, finds A2 as the oldest,delete(c.entries, A2). c.entries[A1] = &pendingWitnessBody{...}replaces A1's existing slot in place.- Net: 9 entries; A2 was dropped despite there being no need to make room.
A peer that asks us for A2's body during the pre-import window now gets an empty GetWitness response and has to fall back to chain-storage lookup post-import, instead of being served from the in-flight cache.
Why existing code doesn't prevent it
The check is purely on len(c.entries). There is no _, exists := c.entries[blockHash] branch like the one in deferredAnnounceCache.put at handler_wit2_announces.go:138-156:
if existing, exists := c.entries[ann.BlockHash]; exists {
// Overwrite for the same hash: net-zero slot change. ...
return
}
// only then enforce caps + evictpendingWitnessBodyCache.put skipped that idiom.
Impact
Functionally minor:
- The cache is a soft serving optimization, not authoritative.
- The evicted entry was the oldest, i.e. lowest serving value remaining.
- On the next
cacheVerifiedWitnessForServing(manager re-cache after another fetch) the entry returns. - A peer that wanted the evicted block falls back to
GetWitnesson chain storage (post-import) or backs off and retries (pre-import).
No correctness implications, no DoS amplification, no peer-discipline interaction. The cost is a small reduction in the pre-import serving window for one unrelated block, per occurrence.
Fix
One line at the top of put:
func (c *pendingWitnessBodyCache) put(blockHash common.Hash, bytes []byte, witnessHash common.Hash) {
c.mu.Lock()
defer c.mu.Unlock()
c.gcLocked()
if _, exists := c.entries[blockHash]; !exists && len(c.entries) >= c.capacity {
// ... existing eviction loop ...
}
c.entries[blockHash] = &pendingWitnessBody{ ... }
}Mirrors the existing deferredAnnounceCache.put overwrite branch and brings the two caches into structural alignment.
| // deferredAnnouncesLoop re-evaluates any deferred WIT2 announcements whose | ||
| // matching block has just been imported. Exits cleanly when the chain-head | ||
| // subscription returns (chain stop) or quitSync is closed. | ||
| func (h *handler) deferredAnnouncesLoop() { | ||
| defer h.wg.Done() | ||
| defer h.wit2HeadSub.Unsubscribe() | ||
|
|
||
| for { | ||
| select { | ||
| case ev, ok := <-h.wit2HeadCh: | ||
| if !ok { | ||
| return | ||
| } | ||
| if ev.Header != nil { | ||
| h.drainDeferredAnnouncesFor(ev.Header.Hash()) | ||
| h.flushWitnessWaitersForImported(ev.Header.Hash()) | ||
| } | ||
| case <-h.wit2HeadSub.Err(): | ||
| return | ||
| case <-h.quitSync: |
There was a problem hiding this comment.
🟡 Deferred WIT2 announces for intermediate blocks of a batched insertChain (e.g. downloader catch-up) are never drained: ChainHeadEvent fires only once per batch with lastCanon, and deferredAnnouncesLoop only drains the head's hash, so announces for blocks N+1..N+k-1 silently expire after the 30s TTL. For those blocks the byte-correctness trust property quietly degrades to unsigned WIT1 handling and the transitive relay is lost. Fix: walk the deferred queue against any header that now resolves locally, or subscribe to per-block ChainEvent.
Extended reasoning...
The defect
eth/handler.go:704-724 — deferredAnnouncesLoop listens to ChainHeadEvent and calls drainDeferredAnnouncesFor(ev.Header.Hash()). drainDeferredAnnouncesFor (eth/handler_wit2.go:340) is a strict hash-keyed take() on the deferred map — it never walks the queue.
core/blockchain.go:3065-3068 fires ChainHeadEvent only once per insertChain batch with lastCanon.Header(). The inline comment at line 2589-2594 is explicit about this: 'Avoid firing too many ChainHeadEvents, we will fire an accumulated ChainHeadEvent and disable fire event here.' The inner writeBlockAndSetHead calls at blockchain.go:3409 and :3613 are made with emitHeadEvent=false.
Net: when the downloader imports blocks N+1..N+k in one batch, deferred announces for intermediate hashes h_{N+1}..h_{N+k-1} sit unmatched in the deferred cache and expire silently after wit2AnnounceTTL (30s).
Consequence
For those intermediate blocks:
signedWitnessesis never populated, solookupSignedWitnessHash()returnsfalse.verifyAgainstSignedHashshort-circuits when no signed hash is on file (witness_manager_wit2.go) — skipping the keccak check against the BP-signed commitment. The 'byte-correctness verified at the receiver against the BP-signed witnessHash' property the PR explicitly introduces silently regresses to unsigned-WIT1 byte handling.- The transitive relay (
relaySignedAnnouncement, only fired from a successful drain) doesn't run, so downstream WIT2 peers never receive these signed announces from this relayer.
Step-by-step proof
- Peer P relays signed announces for blocks N+1, N+2, N+3 ahead of the matching headers (cosend race).
- Each is rejected by
isScheduledProducerwithheaderAvailable=falseand stashed indeferredAnnounces(handler_wit.goacceptSignedAnnouncement). - The downloader catches us up by inserting all three as one batch via
insertChain(blockchain.go:3065deferredChainHeadEvent.Send(...lastCanon...)). deferredAnnouncesLooprunsdrainDeferredAnnouncesFor(h_{N+3})— only the last hash.deferredAnnounces.take(h_{N+1})andtake(h_{N+2})are never called.- 30s later,
gcLockedevicts the two intermediate entries on the nextput. - A stateless peer asks us for the witness of N+1 or N+2 —
signedWitnesses.getmisses;verifyAgainstSignedHashreturns(nil, _, true)(skip path). The byte verification the PR adds is silently bypassed.
Why existing code doesn't prevent it
The stateless paths (insertChainStatelessParallel at blockchain.go:2866,2896, insertChainStatelessSequential at :3002) DO pass emitHeadEvent=true, so stateless-validator imports fire per-block events and are unaffected. The block fetcher's per-block f.insertChain(types.Blocks{block}, ...) also produces single-block batches that work correctly. Only the regular full-node downloader path is affected — and the PR's deployment recommendation explicitly says to deploy 'sentries/RPC/archive first', i.e. exactly the nodes that exercise this path during reconnection catch-up.
Mitigating factors (why I'm filing this as nit)
- The block fetcher's steady-state path inserts one block at a time, so normal gossip is unaffected.
- A node doing downloader catch-up is itself behind and not serving downstream peers, so the lost pre-import serving cache for intermediate blocks has no real downstream impact in that window.
- Behavior gracefully degrades to WIT1 byte-verification, which is the pre-WIT2 baseline. No security regression — execution still validates content via state-root.
- Bounded by the 30s TTL and per-peer cap, so memory is bounded.
Suggested fix
Either:
- In
deferredAnnouncesLoop, afterdrainDeferredAnnouncesFor(ev.Header.Hash()), walkdeferredAnnounces.entriesand drain any whoseblockHashnow resolves to a local header. The set is bounded (capacity 256), so a linear pass on each head event is cheap. - Or subscribe to
ChainEvent(per-block, including reorged blocks) viachainFeed.Send(ChainEvent{...})atblockchain.go:2576, which fires per-block inwriteBlockAndSetHeadfor both stateless and regular paths.





Summary
Adds WIT2 (witness protocol version 3): block producers sign a commitment over each witness, peers verify the signature and relay the announce at network-RTT speed without executing the block, and any peer that has fetched the body can serve it pre-import from an in-memory cache. The slow part of witness propagation — re-execution before relay — is removed from the critical path. Mixed mesh with WIT1 nodes is tolerated; no flag-day rollout required.
Devnet result (4 scenarios, post-fork-only window, hop-chain topology with +300 ms per-hop import knob):
What we're solving
Today on Polygon mainnet, witness propagation through a stateless validator that is multiple hops away from a block producer accumulates a per-hop ~500 ms execution gate: each intermediate node must finish executing the block before it will relay the witness downstream. This serialises along the path and shows up at the receiver as milestone-voting latency — slow milestone votes on a fraction of blocks at multi-hop stateless validators. Adding more peers does not help; the chain of dependencies is fan-in × execution time.
The deliverable is to detach announce from execute so witness availability propagates at gossip speed, while keeping the same byte-correctness guarantee (hash check at the requester, with on-chain blame) and the same content-correctness guarantee (state-root, with BP blame).
How the code achieves it
1. BP-signed witness commitment
The producer needs to commit to which witness bytes are correct without paying ~88 ms of single-thread keccak on the announce path (otherwise we re-introduce the same gate we're trying to remove, just on a different node). See Signing-scheme evaluation below — short version: chunked-parallel keccak at 1 MiB chunks beats the next-best viable candidate by a clear margin and keeps the WIT1 wire format intact.
core/stateless/witness_commit.go::WitnessCommitHash(bytes)=keccak256(concat(per-1MiB-chunk-keccak)). Each 1 MiB chunk is hashed in parallel; final aggregate is one extra keccak over <1 KiB of chunk hashes. ~13.5 ms wall-clock for 50 MiB witnesses on 8 cores vs ~88 ms single-shot keccak — 6.5× speedup, no wire-format change. Producer and verifier agree on the chunk size as a protocol constant.consensus/bor.SignBytesreusing the engine'sSignerFn, with a dedicated mimetypeapplication/x-bor-wit2-announceand a domain-separated digest tag — replay-resistant at both the digest and signer-call levels.2. Verify-and-relay without execution
WIT2 = 3(eth/protocols/wit/protocol.go), new messageSignedNewWitnessHashesMsg = 0x06carrying up to 64 announcements per packet.eth/handler_wit2.go::handleSignedWitnessAnnouncementsdoes ecrecover against the scheduled producer for the announced block; on success the announce is cached and immediately relayed to peers that have not seen this hash. No state execution is touched.3. Pre-import serving cache
pendingWitnessBodies(capacity 10) in the WIT2 handler is fed from the paged-fetch path the moment byte-correctness verification against the BP-signedWitnessHashpasses — i.e. before chain write.handleGetWitnessconsults this cache before chain storage, so a peer that just received the body can serve it to a downstream stateless node before it has finished executing.4. Blame model preserved
WitnessHash; failure attaches to the server that returned the bytes.WitnessHashfor the sameBlockHashis rejected viasignedWitnessCache.putIfNewer, so a peer cannot equivocate witnesses across announcements.5. Rate-limits & DoS shape
6. Compatibility
NewWitnessHashes. Mixed WIT1/WIT2 mesh is tolerated: WIT2 nodes downgrade to WIT1 wire when peering with WIT1 peers (relay handler skips peers withVersion() < wit.WIT2).WitnessHashfield onWitnessMetadataResponseis set by WIT2 servers and ignored by WIT1 readers — wire forward-compatible.Signing-scheme evaluation
Picking the right commitment function for the announce signature is load-bearing for the whole PR: too slow on the producer and we just move the per-hop gate from "execute the block" to "hash the witness"; too weak and we lose the byte-blame property that lets a downstream node disconnect a peer that returned tampered bytes. Four candidates were evaluated end-to-end on synthetic 1–50 MiB witnesses (Apple M4 Pro, Go 1.26.2,
go test -benchtime=3s -count=3, median of three).Candidates
keccak256(canonical_RLP(witness))single-threadkeccak(chunk0_hash ‖ … ‖ chunkN_hash), chunks hashed concurrentlyheader.StateRootto detect bad bytesResult at 50 MiB — verifier wall-clock (best parallel config)
D(intrinsic, 4 cores)44 ms2.0×0 msWhy D was rejected post-bench
D had the most attractive numbers (zero producer cost, 2× verifier speedup, no signature on the announce path) — but a peer can serve a truncated witness whose included nodes all hash consistently up to the BP-signed
header.StateRoot. Branch nodes embed child references as 32-byte hashes inside their own bytes, so dropping a subtree leaves the parent branch nodes' hashes unchanged. The intrinsic walker has no way to distinguish "this hash-reference belongs to a path that was never touched and is intentionally absent" from "this hash-reference belongs to a path that was touched and was adversarially omitted" — only attempting execution would. That destroys pre-execute byte-blame, which is the whole reason WIT2 introduced a content commitment in the first place. A/B/C all preserve byte-blame because they sign over content; truncation changes the commitment, signature mismatch, peer dropped pre-execute.Why B at 1 MiB chunks won
A chunk-size sweep at 50 MiB / 8 cores:
512 KiB shaves a tenth of a ms over 1 MiB at the cost of doubling the chunk count and the per-chunk overhead — 1 MiB is the knee of the curve. Below 512 KiB, per-chunk setup starts dominating. The 4 GB/s ceiling is the M4 Pro's aggregate keccak throughput across 8 P-cores; further parallelism doesn't help with the current keccak primitive.
Verifier-side scaling — B beats A non-trivially only ≥ 30 MiB
For the small witnesses Polygon emits today (typically 1–10 MiB) B is comparable to A; for the large witnesses we already see at the upper tail (30–50 MiB) B is the difference between the producer/verifier paying a ~90 ms gate vs ~14 ms. The fix is most impactful exactly where the problem is worst.
Why not C
C is dominated by every other viable candidate on these numbers: slower verifier than A (122 ms vs 88 ms), 91 MiB / 614 k allocations per verify at 50 MiB, no wire saving. C only becomes interesting if a future design needs sub-witness proofs (proving a specific node belongs to the committed set without sending the full body) — that's not on the roadmap, so C is a no-vote here.
Sensitivity caveats
Full bench artifact (raw numbers, reproduction commands, allocation breakdown):
agent-zero/investigations/witness-propagation/witness-commit-bench.md.Local devnet validation
A 9-node hop-chain devnet on
kurtosis-pos: 4 BPs full-mesh, two relay full-nodes (F1/F2) carrying a +300 ms per-hop import-delay knob to amplify the gate without heavy tx loads, and three stateless validators at hop distances 1 / 2 / 3 from the closest BP (S1 ↔ BP1, S2 ↔ F1, S3 ↔ F2). Topology was enforced post-launch viaadmin_removePeerafter every node imported past Giugliano (block 128 + 72-block settle), so the measurement window is post-fork and post-prune only — pre-fork blocks (different code path) are excluded.Four scenarios, ~30 measured blocks each:
bor:develop(control)bor:wit2bor:wit2, rest =bor:developbor:wit2, rest =bor:developF2import-lag (the relay just before S3) shows the mechanism: median drops 805 → 305 ms in scenario 2 — one full per-hop inject overlapped with WIT2 announcement-driven pre-fetch, exactly what the design predicts.S3's residual p95 of 260 ms in scenario 2 is the single +300 ms inject on F2 still in the critical path: WIT2 lets the F1 hop overlap, but F2 still has to receive and execute the block before serving S3. Without the artificial knob (i.e., on mainnet), the natural per-hop gate is ~50–100 ms and this residual shrinks proportionally.
Full report (per-scenario logs, lag tables, errors/warnings, peer-count snapshots, prune timestamps, image map):
agent-zero/investigations/witness-propagation/devnet-validation-2026-04-30b.md.Re-validation at the current head (2026-06-09, commit
8c16b39): the s2 all-WIT2 scenario was re-run at the convergence-review head with per-node Prometheus capture, which surfaced a defect all code review rounds missed — any node holding a signer key signed announces for foreign blocks, causing honest validators to strike each other. Fixed by binding the sign path to the scheduled producer (maySignAnnouncementForBlock, regression-tested) with a truthful WIT1 hash-announce fallback for non-producers. Post-fix: strikes 0 across all validators, RX_SIGNED +20%, latency unchanged. A Phase-2 large-witness scenario (~18.9 MB witnesses, above the 16 MiB push/gossip cap) verified the multi-page pull path end-to-end: all stateless nodes imported every padded block in lockstep.Backward compatibility — explicit checks
pendingWitnessBodiesskipped when no signed WitnessHash on fileeth/handler_wit2.go::resolveWitnessBytesWitnessHashfield onWitnessMetadataResponseignored by WIT1 readersconsensus/bor.SignBytesconsensus/bor/signbytes_test.goDeployment & rollout safety — why a mixed mesh has no downside
Rollout is fully incremental: no flag day, no coordinated upgrade, no config change. The
witprotocol negotiates per-connection at handshake (ProtocolVersions = [WIT2, WIT1, WIT0], highest mutual version wins), so one node simultaneously speaks WIT2 to upgraded peers and WIT1 to old peers. The new message (SignedNewWitnessHashesMsg = 0x06) is outside WIT1's message range and every send site is gated onpeer.Version() >= WIT2— an old node can never receive a frame it doesn't understand. The witness body transfer (pagedGetWitnesspull) is byte-identical in both versions.Per-peer announce split (producer side). When a producer (or any node holding the witness) announces, each recipient gets the variant its negotiated version supports: WIT2 peers the signed announcement, WIT1/WIT0 peers the legacy unsigned
NewWitnessHashes. Both are truthful (every announce path is gated onHasWitness) and both lead to the same paged pull. Old peers observe exactly the protocol they ran before this PR. The signature is computed once per produced block (cached; ~14 ms keccak + ~100 µs ECDSA) and the same announcement struct is reused across all peers, both broadcast phases, and every relay hop — relayers forward the producer's signature verbatim and cannot re-sign.WIT2 → WIT1 boundary. A node that learns of a witness via a signed announce does not translate it for WIT1 peers at receive time (the transitive relay deliberately skips peers below WIT2 — at that point the node only knows a hash, and the WIT1 announce contract is "I have this witness"). The WIT1 peer is served on the node's post-import announce instead, and can even pull the body from the WIT2 node's in-flight cache pre-import. Net effect: WIT1 peers keep today's latency profile exactly; WIT2's pre-import fast path simply doesn't extend through them.
No cross-version discipline risk. Strike-disconnect only triggers on signed announcements (bad signature / signer ≠ scheduled producer). WIT1 peers physically cannot send those, so an old peer can never be struck and never strikes anyone. On the WIT2 side, only the block's sealer signs announcements (producer binding on the sign path,
maySignAnnouncementForBlock), an announce racing ahead of its header is deferred rather than punished, and rate limiting drops packets without striking — devnet-verified at zero strikes across all honest topologies.Benefits track validator adoption; the upgrade is inert until then. No signed announces exist until block producers run the new version, so a WIT2 full/RPC node in an old mesh behaves byte-for-byte like a WIT1 node. Deploying sentries/RPC/archive first is safe and changes nothing; each validator upgrade then switches on the latency win for its blocks. The worst case anywhere in a mixed mesh is today's propagation behavior, never worse — a WIT2 island reachable only through WIT1 hops degrades to the status quo (measured: scenarios 3/4 above, zero errors, no peer drops).
Suggested rollout order. Upgrade a validator together with its sentries as a unit — the signed announce dies at the first WIT1 hop, so a validator behind old sentries gets no benefit past hop 1. On canaries watch the
eth_wit2_*meters:strike_disconnect,invalid_sig,not_validatorshould be ~0 in an honest mesh (non-zero = a buggy/malicious peer being handled);header_unknownand deferred counts are non-zero by design (announce/block races absorbed by the deferral queue).No hardfork gate. WIT2 negotiates immediately at handshake; what gates real traffic is witness production/consumption, which is already live. The PR changes announcement transport and serving, not witness semantics, block validity, or consensus.
Test plan
core/stateless/witness_commit_test.go,witness_commit_bench_test.go,consensus/bor/signbytes_test.go,eth/handler_wit2_test.go,eth/handler_wit_test.go,eth/peerset_test.go,eth/protocols/wit/protocol_wit2_test.go,eth/fetcher/witness_manager_wit2_test.go.pendingWitnessBodies. We don't expect more than a few in-flight unique witnesses at a time, but worth a second opinion under burst conditions.application/x-bor-wit2-announce.Diffguard / quality-gate notes
eth/handler_wit2.gosplit into three files (handler_wit2.go~410 lines,handler_wit2_bodies.go,handler_wit2_announces.go) — all under the 500-line threshold.handleWitnessBroadcastrefactored into three accept-path helpers (was 96 lines / complexity 17, now within thresholds);deferredAnnounceCache.puteviction extracted (complexity 13 → under 10).WitnessCommitHashcomplexity (was 18) resolved by extracting the worker-pool fan-out intohashWitnessChunks/witnessCommitWorkerCount; commitment recipe unchanged (pinned by the shape tests).witness_manager.gogrowth (+161 vs base, over the oversized-file delta tolerance) resolved by moving the PR-added WIT2 code (byte-verification, pre-import serving handoff, empty-response backoff) intoeth/fetcher/witness_manager_wit2.go.newHandler,BroadcastBlock,handleGetWitness, file sizes ofhandler.go/bor.go/block_fetcher.go/peer.go) plus one accepted item:(Peer).broadcastWitnesscomplexity 12 — upstream-shaped select loop; this PR adds one case.govulncheckfailures are unrelated to this PR: two Go 1.26.3 standard-library vulns (GO-2026-5039 net/textproto, GO-2026-5037 crypto/x509, both fixed in go1.26.4) reached through pre-existing code (node/endpoints.go,ethstats,internal/cli); develop's latest Govuln run fails identically. Needs a repo-wide toolchain bump, separate from this PR.