Skip to content

[BEAM] Fix Python VarIntCoder OverflowError on uint64 values#39047

Draft
AviKndr wants to merge 1 commit into
apache:masterfrom
AviKndr:fix-varintcoder-uint64
Draft

[BEAM] Fix Python VarIntCoder OverflowError on uint64 values#39047
AviKndr wants to merge 1 commit into
apache:masterfrom
AviKndr:fix-varintcoder-uint64

Conversation

@AviKndr

@AviKndr AviKndr commented Jun 20, 2026

Copy link
Copy Markdown

What

VarIntCoder raises OverflowError when handed a Python int in the unsigned 64-bit range [2**63, 2**64) (a uint64), even though such a value has a well-defined VarInt encoding.

Why

In the Cython build, the stream methods take a signed int64_t:

cpdef write_var_int64(self, libc.stdint.int64_t signed_v):
    cdef libc.stdint.uint64_t v = signed_v   # body already works on unsigned

The body re-casts to uint64_t and encodes the bit pattern correctly, but Cython converts the incoming Python int to the signed int64_t parameter at the call boundary — before the body runs — so any uint64 value is rejected with OverflowError. The same applies to get_varint_size used by estimate_size.

Fix

A uint64 value v and the signed int64 v - 2**64 (its two's-complement twin) produce identical VarInt bytes. VarIntCoderImpl now folds uint64 values to that signed twin before they cross into Cython, in both encode_to_stream and estimate_size:

def _as_signed_int64(value):
  if (1 << 63) <= value < (1 << 64):
    return int(value) - (1 << 64)
  return value
  • Wire-compatible — byte-identical to Java's signed VarIntCoder.
  • No regression — normal signed ints pass through untouched; the pure-Python slow_stream path produces the same bytes.
  • Still bounded — values >= 2**64 are left unchanged and still overflow in the Cython path, preserving the coder's 64-bit contract.

Note: decoding remains signed (read_var_int64 returns int64_t), matching Java — so the encoding of 2**64 - 1 decodes back to -1. This change removes the crash and guarantees correct wire bytes; it does not make the coder round-trip uint64 back to unsigned (that would be a separate coder).

Tests

Adds test_varint_coder_uint64 to the shared coders_test_common.py suite (runs both compiled and uncompiled): no-overflow encoding, wire equivalence to the signed twin, size estimation, signed decode semantics, and the still-raising out-of-range case (guarded on the compiled implementation).


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

🤖 Generated with Claude Code

The Cython write_var_int64/get_varint_size stream methods take a signed
int64_t parameter. A Python int in the unsigned 64-bit range [2**63,
2**64) -- a uint64 -- is converted to that signed parameter at the call
boundary and rejected with an OverflowError before the method body runs,
even though the body already operates on the unsigned bit pattern and the
VarInt wire encoding is well-defined.

Fold such values to the signed int64 with the identical bit pattern (and
thus identical VarInt encoding) before handing them to the stream. This
matches Java's signed VarIntCoder on the wire; decoding remains signed.
Values past 64 bits are left unchanged and still overflow downstream,
preserving the coder's documented 64-bit limit.

Adds test_varint_coder_uint64 covering no-overflow encoding, wire
equivalence to the signed twin, size estimation, signed decode, and the
still-raising out-of-range case (guarded on the compiled implementation).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant