Skip to content

fix(community): preserve non-ASCII in BigQuery callback JSON serialization#1827

Open
Humphrey (HumphreySun98) wants to merge 1 commit into
langchain-ai:mainfrom
HumphreySun98:fix/community-bigquery-callback-ensure-ascii
Open

fix(community): preserve non-ASCII in BigQuery callback JSON serialization#1827
Humphrey (HumphreySun98) wants to merge 1 commit into
langchain-ai:mainfrom
HumphreySun98:fix/community-bigquery-callback-ensure-ascii

Conversation

@HumphreySun98

Copy link
Copy Markdown

Description

`BigQueryCallbackHandler` (sync, async, and langgraph variants) builds the `content` for BigQuery JSON columns — chain inputs, outputs, retrieved docs, tool calls, agent actions, langgraph attributes — with bare `json.dumps(...)`. Python's default `ensure_ascii=True` escapes every non-ASCII character to `\uXXXX`, so CJK / emoji / accented text in user input lands in BigQuery as escape sequences and is unreadable when the row is inspected directly (which is the whole point of the callback).

Pass `ensure_ascii=False` at every `json.dumps` site in `callbacks/bigquery_callback.py`: the 5 fall-throughs in `_prepare_arrow_batch` (JSON-extension column path + dict-in-string-column fallback), the `part_attributes` serialization for tool-call parts, and every `content=json.dumps(...)` invocation for inputs / outputs / docs / `AgentAction` / `AgentFinish` / langgraph `graph_name` across both the sync and async handlers.

This is the same convention already established in `langchain-openai`'s chat model, `langchain-core`'s `messages/utils.py:1810`, and our just-shipped genai/vertexai parsing fixes (#1804, #1823). The BigQuery callback is the last hot serialization path in this package that was still escaping.

Relevant issues

No issue filed for this specific package — surfaced via a repo-wide grep for `json.dumps` without `ensure_ascii` after #1804 landed. Same impact for end users: non-ASCII inputs become `\uXXXX` in storage.

Type

🐛 Bug Fix

Changes

  • `libs/community/langchain_google_community/callbacks/bigquery_callback.py`: add `ensure_ascii=False` to every `json.dumps` call. No control-flow changes.
  • `libs/community/tests/unit_tests/callbacks/test_bigquery_callbacks.py`: two focused tests that exercise `_prepare_arrow_batch` end-to-end and assert CJK + emoji round-trip into the resulting `pa.RecordBatch`:
    • `test_prepare_arrow_batch_preserves_non_ascii_in_json_column` (JSON extension path),
    • `test_prepare_arrow_batch_preserves_non_ascii_in_plain_dict_column` (dict-in-string fallback).

Testing

```
$ python -m pytest libs/community/tests/unit_tests/callbacks/test_bigquery_callbacks.py -k preserves_non_ascii -v
2 passed in 9.75s
```

Reverting any one of the `ensure_ascii=False` additions while keeping the new tests makes the corresponding assertion fail (`"你好" in serialized` → AssertionError). `ruff check` and `ruff format --check` pass.

Note

No behavior change for ASCII-only inputs (output bytes are identical). No deps added, no public API touched.

Disclaimer: this PR was prepared with the assistance of an AI agent (Claude Code). All code and test changes were reviewed by the author before submission.

…ation

`BigQueryCallbackHandler` (and the langgraph/async variants) build the
content for BigQuery JSON columns with bare `json.dumps(...)`. Python's
default `ensure_ascii=True` escapes every non-ASCII character to
`\uXXXX`, so CJK / emoji / accented text from chain inputs, outputs,
documents, tool calls, agent actions, and langgraph attributes land in
storage as escape sequences and are unreadable when inspecting the
BigQuery row directly.

Pass `ensure_ascii=False` at every `json.dumps` site in
`callbacks/bigquery_callback.py` and add unit-test coverage on
`_prepare_arrow_batch` asserting CJK and emoji round-trip into the
resulting `pa.RecordBatch`.

The convention matches what `langchain-openai`, `langchain-core`
(`messages/utils.py:1810`), and our just-shipped genai/vertexai
`_parse_response_candidate` fixes (langchain-ai#1804, langchain-ai#1823) already use.
@HumphreySun98 Humphrey (HumphreySun98) force-pushed the fix/community-bigquery-callback-ensure-ascii branch from 268abea to b9ed021 Compare June 8, 2026 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant