commit	cb5c5c07318ba602c6c63cb116774a12c52fc478	[log] [tgz]
author	Arvid Heise <arvid@apache.org>	Tue Sep 30 10:11:14 2025 +0200
committer	Arvid Heise <AHeise@users.noreply.github.com>	Fri Oct 10 08:30:24 2025 +0200
tree	c99b75e368c6a40335b4c8d1d2d614c8f17f0535
parent	d39c079c82c02d7abb6c1ae8544ea3ef3ce212a3 [diff]

[FLINK-38453] Add full splits to KafkaSourceEnumState

KafkaEnumerator's state contains the TopicPartitions only but not the offsets, so it doesn't contain the full split state contrary to the design intent.

There are a couple of issues with that approach. It implicitly assumes that splits are fully assigned to readers before the first checkpoint. Else the enumerator will invoke the offset initializer again on recovery from such a checkpoint leading to inconsistencies (LATEST may be initialized during the first attempt for some partitions and initialized during second attempt for others).

Through addSplitBack callback, you may also get these scenarios later for BATCH which actually leads to duplicate rows (in case of EARLIEST or SPECIFIC-OFFSETS) or data loss (in case of LATEST). Finally, it's not possible to safely use KafkaSource as part of a HybridSource because the offset initializer cannot even be recreated on recovery.

All cases are solved by also retaining the offset in the enumerator state. To that end, this commit merges the async discovery phases to immediately initialize the splits from the partitions. Any subsequent checkpoint will contain the proper start offset.

flink-connector-kafka/archunit-violations/c0d94764-76a0-4c50-b617-70b1754c4612[diff]
flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/dynamic/source/enumerator/DynamicKafkaSourceEnumerator.java[diff]
flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/enumerator/AssignmentStatus.java[diff]
flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/enumerator/KafkaSourceEnumState.java[diff]
flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/enumerator/KafkaSourceEnumStateSerializer.java[diff]
flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/enumerator/KafkaSourceEnumerator.java[diff]
flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/enumerator/SplitAndAssignmentStatus.java[Renamed from flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/enumerator/TopicPartitionAndAssignmentStatus.java - diff]
flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/split/KafkaPartitionSplit.java[diff]
flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/dynamic/source/enumerator/DynamicKafkaSourceEnumStateSerializerTest.java[diff]
flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/dynamic/source/enumerator/DynamicKafkaSourceEnumeratorTest.java[diff]
flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/source/enumerator/KafkaSourceEnumStateSerializerTest.java[diff]
flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/source/enumerator/KafkaSourceEnumeratorTest.java[Renamed from flink-connector-kafka/src/test/java/org/apache/flink/connector/kafka/source/enumerator/KafkaEnumeratorTest.java - diff]

12 files changed

tree: c99b75e368c6a40335b4c8d1d2d614c8f17f0535

README.md

Apache Flink Kafka Connector

This repository contains the official Apache Flink Kafka connector.

Apache Flink

Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities.

Learn more about Flink at https://flink.apache.org/

Building the Apache Flink Kafka Connector from Source

Prerequisites:

Unix-like environment (we use Linux, Mac OS X)
Git
Maven (we recommend version 3.8.6)
Java 11

git clone https://github.com/apache/flink-connector-kafka.git
cd flink-connector-kafka
mvn clean package -DskipTests

The resulting jars can be found in the target directory of the respective module.

Developing Flink

The Flink committers use IntelliJ IDEA to develop the Flink codebase. We recommend IntelliJ IDEA for developing projects that involve Scala code.

Minimal requirements for an IDE are:

Support for Java and Scala (also mixed projects)
Support for Maven with Java and Scala

IntelliJ IDEA

The IntelliJ IDE supports Maven out of the box and offers a plugin for Scala development.

IntelliJ download: https://www.jetbrains.com/idea/
IntelliJ Scala Plugin: https://plugins.jetbrains.com/plugin/?id=1347

Check out our Setting up IntelliJ guide for details.

Support

Don’t hesitate to ask!

Contact the developers and community on the mailing lists if you need any help.

Open an issue if you found a bug in Flink.

Documentation

The documentation of Apache Flink is located on the website: https://flink.apache.org or in the docs/ directory of the source code.

Fork and Contribute

This is an active open-source project. We are always open to people who want to use the system or contribute to it. Contact us if you are looking for implementation tasks that fit your skills. This article describes how to contribute to Apache Flink.

About

Apache Flink is an open source project of The Apache Software Foundation (ASF). The Apache Flink project originated from the Stratosphere research project.