[GSoC 2026] Kafka Streams runner — KStreamsPayload Serde (GroupByKey prerequisite)#39051
Conversation
KStreamsPayload has so far only flowed in-JVM via ProcessorContext#forward, so it needed no serialization. GroupByKey introduces the first real topic — the key-based repartition topic — so the payload now has to be serialized. KStreamsPayloadSerde<T> is parameterized by the Coder<WindowedValue<T>> for the data variant, since different topics carry different element types; the watermark report variant is coder-independent. The wire format is a one-byte discriminator followed by the variant body: data is the windowed-value-coder encoding; watermark is the millis + sourcePartition + totalSourcePartitions report. The serde assumes non-null payloads, since the topics it is used on (repartition and watermark fan-out) are not log-compacted. Refs apache#18479
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a serialization mechanism for KStreamsPayload to support Kafka topic boundaries, which is a prerequisite for implementing GroupByKey in the Kafka Streams runner. By providing a custom Serde, the runner can now safely transmit both data elements and watermark reports across topics, whereas previously these payloads were restricted to in-JVM memory. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces KStreamsPayloadSerde and its corresponding unit tests to support the serialization and deserialization of KStreamsPayload (both data and watermark variants) across Kafka topic boundaries. The review feedback suggests handling null inputs gracefully in both the serializer and deserializer to prevent potential NullPointerExceptions, which aligns with standard Kafka serialization practices.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| public byte[] serialize(String topic, KStreamsPayload<T> payload) { | ||
| ByteArrayOutputStream out = new ByteArrayOutputStream(); | ||
| try { |
There was a problem hiding this comment.
In Kafka, serializers should handle null payloads gracefully by returning null (representing a tombstone or empty payload). Currently, passing a null payload will cause a NullPointerException when calling payload.isData().
@Override
public byte[] serialize(String topic, KStreamsPayload<T> payload) {
if (payload == null) {
return null;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();| @Override | ||
| public KStreamsPayload<T> deserialize(String topic, byte[] bytes) { | ||
| ByteArrayInputStream in = new ByteArrayInputStream(bytes); |
There was a problem hiding this comment.
In Kafka, deserializers should handle null bytes gracefully by returning null. Currently, passing null bytes to ByteArrayInputStream will cause a NullPointerException.
@Override
public KStreamsPayload<T> deserialize(String topic, byte[] bytes) {
if (bytes == null) {
return null;
}
ByteArrayInputStream in = new ByteArrayInputStream(bytes);|
Assigning reviewers: R: @Abacn added as fallback since no labels match configuration Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
je-ik
left a comment
There was a problem hiding this comment.
I would be in favor of moving to protobuf serialization. Something like
message KafkaStreamsPayload {
enum Type {
UNKNOWN = 0;
WATERMARK = 1;
DATA = 2;
}
message WatermarkPaylaod {
uint64 millis = 1;
uint32 sourcePartition = 2;
uint32 totalPartitions = 3;
}
message DataPayload {
bytes payload = 1;
}
Type type = 1;
oneof payload {
WatermarkPayload watermark = 1;
DataPayload data = 2;
}
}
| DataOutputStream dataOut = new DataOutputStream(out); | ||
| dataOut.writeByte(WATERMARK_TAG); | ||
| dataOut.writeLong(watermark.getWatermarkMillis()); | ||
| dataOut.writeInt(watermark.getSourcePartition()); |
There was a problem hiding this comment.
It would be probably better to use protocol buffers for the serialization. It naturally supports schema evolution in a compatible way and also optimized payload sizes through varint/varlong encoders.
Summary
A Kafka Serde for
KStreamsPayloadso the envelope can cross a topic boundary. Until now it only flowed in-JVM viaProcessorContext#forward; GroupByKey introduces the first real topic (the key-based repartition topic, per the plan agreed with @je-ik), which needs the payload serialized. Split out as its own small PR ahead of GBK.Scope
KStreamsPayloadSerde<T>: parameterized by theCoder<WindowedValue<T>>forthe data variant (different topics carry different element types; the watermark
variant is coder-independent). Wire format: a one-byte discriminator + body —
data =
[0x00][windowedValueCoder-encoded value]; watermark =[0x01][long millis][int sourcePartition][int totalSourcePartitions].unknown-tag failure.
Out of scope
GroupByKey PR.
Notes
fan-out) are not log-compacted, so no tombstones occur.
Testing
./gradlew :runners:kafka-streams:checkgreen; 4 unit tests.Closes #39042
Refs #18479
cc @je-ik