[GSoC 2026] Kafka Streams runner — KStreamsPayload Serde (GroupByKey prerequisite) by junaiddshaukat · Pull Request #39051 · apache/beam

junaiddshaukat · 2026-06-21T08:29:27Z

Summary

A Kafka Serde for KStreamsPayload so the envelope can cross a topic boundary. Until now it only flowed in-JVM via ProcessorContext#forward; GroupByKey introduces the first real topic (the key-based repartition topic, per the plan agreed with @je-ik), which needs the payload serialized. Split out as its own small PR ahead of GBK.

Scope

KStreamsPayloadSerde<T>: parameterized by the Coder<WindowedValue<T>> for
the data variant (different topics carry different element types; the watermark
variant is coder-independent). Wire format: a one-byte discriminator + body —
data = [0x00][windowedValueCoder-encoded value]; watermark =
[0x01][long millis][int sourcePartition][int totalSourcePartitions].
Unit tests: round-trip of data, watermark, and terminal MAX watermark, plus an
unknown-tag failure.

Out of scope

Wiring the serde into the GBK repartition / watermark fan-out — that's the
GroupByKey PR.

Notes

Assumes non-null payloads: the topics it's used on (repartition, watermark
fan-out) are not log-compacted, so no tombstones occur.

Testing

./gradlew :runners:kafka-streams:check green; 4 unit tests.

Closes #39042
Refs #18479
cc @je-ik

KStreamsPayload has so far only flowed in-JVM via ProcessorContext#forward, so it needed no serialization. GroupByKey introduces the first real topic — the key-based repartition topic — so the payload now has to be serialized. KStreamsPayloadSerde<T> is parameterized by the Coder<WindowedValue<T>> for the data variant, since different topics carry different element types; the watermark report variant is coder-independent. The wire format is a one-byte discriminator followed by the variant body: data is the windowed-value-coder encoding; watermark is the millis + sourcePartition + totalSourcePartitions report. The serde assumes non-null payloads, since the topics it is used on (repartition and watermark fan-out) are not log-compacted. Refs apache#18479

gemini-code-assist · 2026-06-21T08:29:35Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a serialization mechanism for KStreamsPayload to support Kafka topic boundaries, which is a prerequisite for implementing GroupByKey in the Kafka Streams runner. By providing a custom Serde, the runner can now safely transmit both data elements and watermark reports across topics, whereas previously these payloads were restricted to in-JVM memory.

Highlights

New Kafka Serde Implementation: Introduced KStreamsPayloadSerde to enable serialization of KStreamsPayload objects for cross-topic communication.
Wire Format Definition: Implemented a byte-based discriminator format to distinguish between data payloads (using a configurable Coder) and watermark payloads.
Unit Testing: Added comprehensive tests for data and watermark round-trips, including handling of terminal watermarks and error cases for unknown tags.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces KStreamsPayloadSerde and its corresponding unit tests to support the serialization and deserialization of KStreamsPayload (both data and watermark variants) across Kafka topic boundaries. The review feedback suggests handling null inputs gracefully in both the serializer and deserializer to prevent potential NullPointerExceptions, which aligns with standard Kafka serialization practices.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-21T08:30:07Z

+    public byte[] serialize(String topic, KStreamsPayload<T> payload) {
+      ByteArrayOutputStream out = new ByteArrayOutputStream();
+      try {


In Kafka, serializers should handle null payloads gracefully by returning null (representing a tombstone or empty payload). Currently, passing a null payload will cause a NullPointerException when calling payload.isData().

@Override public byte[] serialize(String topic, KStreamsPayload<T> payload) { if (payload == null) { return null; } ByteArrayOutputStream out = new ByteArrayOutputStream();

gemini-code-assist · 2026-06-21T08:30:07Z

+    @Override
+    public KStreamsPayload<T> deserialize(String topic, byte[] bytes) {
+      ByteArrayInputStream in = new ByteArrayInputStream(bytes);


In Kafka, deserializers should handle null bytes gracefully by returning null. Currently, passing null bytes to ByteArrayInputStream will cause a NullPointerException.

@Override public KStreamsPayload<T> deserialize(String topic, byte[] bytes) { if (bytes == null) { return null; } ByteArrayInputStream in = new ByteArrayInputStream(bytes);

github-actions · 2026-06-21T09:23:09Z

Assigning reviewers:

R: @Abacn added as fallback since no labels match configuration

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

je-ik

I would be in favor of moving to protobuf serialization. Something like
message KafkaStreamsPayload {
enum Type {
UNKNOWN = 0;
WATERMARK = 1;
DATA = 2;
}
message WatermarkPaylaod {
uint64 millis = 1;
uint32 sourcePartition = 2;
uint32 totalPartitions = 3;
}
message DataPayload {
bytes payload = 1;
}
Type type = 1;
oneof payload {
WatermarkPayload watermark = 1;
DataPayload data = 2;
}
}

je-ik · 2026-06-22T07:59:42Z

+          DataOutputStream dataOut = new DataOutputStream(out);
+          dataOut.writeByte(WATERMARK_TAG);
+          dataOut.writeLong(watermark.getWatermarkMillis());
+          dataOut.writeInt(watermark.getSourcePartition());


It would be probably better to use protocol buffers for the serialization. It naturally supports schema evolution in a compatible way and also optimized payload sizes through varint/varlong encoders.

github-actions Bot added the runners label Jun 21, 2026

gemini-code-assist Bot reviewed Jun 21, 2026

View reviewed changes

github-actions Bot added the Next Action: Reviewers label Jun 21, 2026

je-ik self-requested a review June 22, 2026 07:57

je-ik reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC 2026] Kafka Streams runner — KStreamsPayload Serde (GroupByKey prerequisite)#39051

[GSoC 2026] Kafka Streams runner — KStreamsPayload Serde (GroupByKey prerequisite)#39051
junaiddshaukat wants to merge 1 commit into
apache:feat/18479-kafka-streams-runner-skeletonfrom
junaiddshaukat:feat/ks-payload-serde

junaiddshaukat commented Jun 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jun 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 21, 2026

Uh oh!

gemini-code-assist Bot Jun 21, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

je-ik left a comment

Uh oh!

je-ik Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

junaiddshaukat commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Out of scope

Notes

Testing

Uh oh!

gemini-code-assist Bot commented Jun 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

je-ik left a comment

Choose a reason for hiding this comment

Uh oh!

je-ik Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

junaiddshaukat commented Jun 21, 2026 •

edited

Loading