fix(community): preserve non-ASCII in BigQuery callback JSON serialization#1827
Open
Humphrey (HumphreySun98) wants to merge 1 commit into
Conversation
…ation `BigQueryCallbackHandler` (and the langgraph/async variants) build the content for BigQuery JSON columns with bare `json.dumps(...)`. Python's default `ensure_ascii=True` escapes every non-ASCII character to `\uXXXX`, so CJK / emoji / accented text from chain inputs, outputs, documents, tool calls, agent actions, and langgraph attributes land in storage as escape sequences and are unreadable when inspecting the BigQuery row directly. Pass `ensure_ascii=False` at every `json.dumps` site in `callbacks/bigquery_callback.py` and add unit-test coverage on `_prepare_arrow_batch` asserting CJK and emoji round-trip into the resulting `pa.RecordBatch`. The convention matches what `langchain-openai`, `langchain-core` (`messages/utils.py:1810`), and our just-shipped genai/vertexai `_parse_response_candidate` fixes (langchain-ai#1804, langchain-ai#1823) already use.
268abea to
b9ed021
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
`BigQueryCallbackHandler` (sync, async, and langgraph variants) builds the `content` for BigQuery JSON columns — chain inputs, outputs, retrieved docs, tool calls, agent actions, langgraph attributes — with bare `json.dumps(...)`. Python's default `ensure_ascii=True` escapes every non-ASCII character to `\uXXXX`, so CJK / emoji / accented text in user input lands in BigQuery as escape sequences and is unreadable when the row is inspected directly (which is the whole point of the callback).
Pass `ensure_ascii=False` at every `json.dumps` site in `callbacks/bigquery_callback.py`: the 5 fall-throughs in `_prepare_arrow_batch` (JSON-extension column path + dict-in-string-column fallback), the `part_attributes` serialization for tool-call parts, and every `content=json.dumps(...)` invocation for inputs / outputs / docs / `AgentAction` / `AgentFinish` / langgraph `graph_name` across both the sync and async handlers.
This is the same convention already established in `langchain-openai`'s chat model, `langchain-core`'s `messages/utils.py:1810`, and our just-shipped genai/vertexai parsing fixes (#1804, #1823). The BigQuery callback is the last hot serialization path in this package that was still escaping.
Relevant issues
No issue filed for this specific package — surfaced via a repo-wide grep for `json.dumps` without `ensure_ascii` after #1804 landed. Same impact for end users: non-ASCII inputs become `\uXXXX` in storage.
Type
🐛 Bug Fix
Changes
Testing
```
$ python -m pytest libs/community/tests/unit_tests/callbacks/test_bigquery_callbacks.py -k preserves_non_ascii -v
2 passed in 9.75s
```
Reverting any one of the `ensure_ascii=False` additions while keeping the new tests makes the corresponding assertion fail (`"你好" in serialized` → AssertionError). `ruff check` and `ruff format --check` pass.
Note
No behavior change for ASCII-only inputs (output bytes are identical). No deps added, no public API touched.
Disclaimer: this PR was prepared with the assistance of an AI agent (Claude Code). All code and test changes were reviewed by the author before submission.