Skip to content

Feat/mm2 fault tolerance#21848

Open
rjain2778 wants to merge 407 commits intoapache:trunkfrom
rjain2778:feat/mm2-fault-tolerance
Open

Feat/mm2 fault tolerance#21848
rjain2778 wants to merge 407 commits intoapache:trunkfrom
rjain2778:feat/mm2-fault-tolerance

Conversation

@rjain2778
Copy link

Kafka Data Replication Project

Problem Statement

Overview

This project implements fault-tolerant data replication between two Kafka clusters using an enhanced MirrorMaker 2 (MM2) connector. The solution addresses two critical failure modes that vanilla MM2 does not handle:

  1. Log Truncation: When Kafka's retention policy deletes messages before MM2 replicates them, vanilla MM2 silently skips those messages, leaving the disaster recovery (DR) cluster with permanent, invisible gaps in its Write-Ahead Log (WAL).
    Solution: Added checkForLogTruncation() method called at the top of every poll() cycle:
    Why fail-fast? Log truncation represents irrecoverable data loss. The DR cluster is now permanently incomplete. Continuing replication would create a DR cluster that appears healthy but cannot fully reconstruct system state on failover.

  2. Topic Reset: When an operator deletes and recreates a source topic, vanilla MM2 crashes with OffsetOutOfRangeException or silently skips new messages because its stored checkpoint offsets no longer exist.
    Solution: Two complementary detection mechanisms:

  3. Leader Epoch Regression: Every ConsumerRecord carries the partition's leader epoch. A backward jump triggers

  4. OffsetOutOfRangeException Fallback: Catches resets while MM2 was paused:
    Why auto-recover? A topic reset is deliberate. The operator chose to wipe the source. The correct response is to replicate the new topic from the beginning.

FrankYang0529 and others added 30 commits January 15, 2025 16:47
apache#18549)

Reviewers: Ismael Juma <ismael@juma.me.uk>, Gaurav Narula <gaurav_narula2@apple.com>, TengYao Chi <kitingiao@gmail.com>
…pache#18543)

Reviewers: Bruno Cadonna <bruno@confluent.io>, Alieh Saeedi <asaeedi@confluent.io>
Reviewers: Bill Bejeck <bill@confluent.io>
…ly (apache#18490)

RocksDBTimeOrderedKeyValueBuffer is not initialize with serdes provides
via Joined, but always uses serdes from StreamsConfig.

Reviewers: Bill Bejeck <bill@confluent.io>
…n without records (apache#18448)

Fix the issue where producer.commitTransaction under transaction version 2 throws error if no partition or offset is added to transaction. The solution is to avoid sending the endTxnRequest unless producer.send or producer.sendOffsetsToTransaction is triggered.

Reviewers: Justine Olshan <jolshan@confluent.io>
Reviewers: Bill Bejeck <bill@confluent.io>
A prior commit introduced checking for the version of a node related to move to log4j2 but it was causing an error
AttributeError("'ClusterNode' object has no attribute 'version'") This PR uses the get_version method from version.py which checks if the Node has a version attribute preventing an error.

Reviewers: Matthias Sax <mjsax@apache.org>
Removed Optional for SharePartitionManager and ClientMetricsManager as zookeeper code is being removed. Also removed asScala and asJava conversion in KafkaApis.handleListClientMetricsResources, moved to java stream.

Reviewers: Mickael Maison <mickael.maison@gmail.com>, Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>
…fter removing zookeeper (apache#18365)

This patch introduces a new page to document the configs and metrics that have been removed in the transition to 4.0. While these removed items lack a formal deprecation cycle as they are part of KIP-500, KIP-500 itself does not provide an exhaustive list of all impacted configs and metrics. Therefore, this new page aims to assist Kafka users in understanding the specific configs and metrics that have been removed in the 4.0 release.

Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
The PR removes dependency of server module on share-coordinator, rather it should be other way. Moved the ShareCoordinatorConfig class from server to share-coordinator.

Reviewers: Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Chia-Ping Tsai <chia7712@gmail.com>
…18414)

In 4.0, there is no ZK mode and both of these configs are required in kraft mode.

Reviewers: Ismael Juma <ismael@juma.me.uk>
Reviewers: Luke Chen <showuon@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
Reviewers: Ismael Juma <ismael@juma.me.uk>, Divij Vaidya <diviv@amazon.com>
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Reviewers: Chia-Ping Tsai <chia7712@gmail.com>
Reviewers: Luke Chen <showuon@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
…EFAULT and NUM_RECOVERY_THREADS_PER_DATA_DIR_CONFIG (apache#18106)

Reviewers: Divij Vaidya <diviv@amazon.com>
…ns (apache#18568)


Reviewers: Mickael Maison <mickael.maison@gmail.com>
…rtitionsMetadata, ZkConfigRepository, DelayedDeleteTopics (apache#18574)


Reviewers: Mickael Maison <mickael.maison@gmail.com>
…rtitionsAssignedCallback (apache#18515)

Reviewers: Lianet Magrans <lmagrans@confluent.io>
Reviewers: Andrew Schofield <aschofield@confluent.io>
Reviewers: Mickael Maison <mickael.maison@gmail.com>, Ismael Juma <ismael@juma.me.uk>
…pache#18565)

Since the example.com DNS lookup changed the second time within one
year, we rewrote the unit tests for ClientUtils so that they do
not make a real DNS lookup to the outside but use mocks.

Reviewers: PoAn Yang <payang@apache.org>, Chia-Ping Tsai <chia7712@gmail.com>, Lianet Magrans <lmagrans@confluent.io>
Remove KafkaController and related unused references:

* ControllerChannelContext
* ControllerChannelManager
* ControllerEventManager
* ControllerState
* PartitionStateMachine
* ReplicaStateMachine
* TopicDeletionManager
* ZkBrokerEpochManager

Reviewers: Ismael Juma <ismael@juma.me.uk>
…#18406)

Add some logs when offline/online happens.

Reviewers: David Jacot <djacot@confluent.io>
cmccabe and others added 10 commits March 7, 2025 13:58
In "Upgrading to 4.0.0 from any version 0.8.x through 3.9.x" section, we
directly give instructions about [Upgrading to KRaft-based
clusters](https://kafka.apache.org/documentation/#upgrade_390_kraft),
but there might still be some users under ZK cluster before upgrading to
v4.0.0. We need to make it clear that they need to upgrade to KRaft mode
first before upgrading to v4.0.0 in "Upgrading to 4.0.0 from any version
0.8.x through 3.9.x" section.

Reviewers: TengYao Chi <kitingiao@gmail.com>, Ken Huang <s7133700@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
)

Use "incompatible" instead of an empty cell in Kafka Streams broker
compatibility docs. Also update compatibility matrix for 4.0.0 release.

Reviewers: Matthias J. Sax <matthias@confluent.io>
Add new section for Kafka 4.0 compatibility metrics for user, so user
can check server and client in this section.

Reviewers: Divij Vaidya <diviv@amazon.com>, Matthias J. Sax <matthias@confluent.io>
This patch adds a section about upgrading clients to the upgrade notes.

Reviewers: Ismael Juma <ismael@juma.me.uk>, David Jacot <djacot@confluent.io>
Skip kraft.version when applying FeatureLevelRecord records. The kraft.version is stored as control records and not as metadata records. This solution has the benefits of removing from snapshots any FeatureLevelRecord for kraft.version that was incorrectly written to the log and allows ApiVersions to report the correct finalized kraft.version.

Reviewers: Colin P. McCabe <cmccabe@apache.org>
…ion (apache#19164)

Fixes two issues:
 - only commit TX if no revoked tasks need to be committed
 - commit revoked tasks after punctuation triggered

Reviewers: Lucas Brutschy <lbrutschy@confluent.io>, Anna Sophie Blee-Goldman <sophie@responsive.dev>, Bruno Cadonna <bruno@confluent.io>, Bill Bejeck <bill@confluent.io>
@github-actions github-actions bot added triage PRs from the community streams core Kafka Broker producer consumer tools connect performance kraft mirror-maker-2 dependencies Pull requests that update a dependency file storage Pull requests that target the storage module tiered-storage Related to the Tiered Storage feature KIP-932 Queues for Kafka build Gradle build or GitHub Actions docker Official Docker image generator RPC and Record code generator transactions Transactions and EOS clients group-coordinator labels Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Gradle build or GitHub Actions clients connect consumer core Kafka Broker dependencies Pull requests that update a dependency file docker Official Docker image generator RPC and Record code generator group-coordinator KIP-932 Queues for Kafka kraft mirror-maker-2 performance producer storage Pull requests that target the storage module streams tiered-storage Related to the Tiered Storage feature tools transactions Transactions and EOS triage PRs from the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.