Apache Kafka
MirrorMaker 2.0
Data Duplication
Message Streaming
Kafka Messaging System

Kafka MirrorMaker 2.0 duplicate each messages

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

MirrorMaker 2.0 is designed for reliable cross-cluster replication, not for guaranteeing a globally duplicate-free stream under every failure mode. If you see duplicated records, the first step is to determine whether you are seeing normal at-least-once behavior, a replication loop, or an application-level replay.

Why Duplicates Happen

MirrorMaker 2.0 is built on Kafka Connect. That means it inherits Kafka Connect retry behavior and Kafka's general bias toward at-least-once delivery unless you design the whole pipeline very carefully.

The usual duplicate scenarios are:

  • a record is produced to the target cluster, but the connector task crashes before the source offset is committed
  • the task restarts and republishes the same source record
  • active-active replication creates a loop where already mirrored records are mirrored back again
  • downstream consumers treat retry delivery as new business events because they do not deduplicate by key or id

That first case is the most common. It is not MirrorMaker being broken; it is the normal result of recovering safely after uncertainty.

Distinguish Retries From Replication Loops

A few duplicate records during connector restarts point to retry behavior. A steadily growing echo of the same data usually points to a loop.

MirrorMaker 2.0 uses replication policies to rename remote topics. A common pattern is sourceCluster.topicName. That naming helps prevent a record mirrored from cluster A to cluster B from being mistaken for a local topic on B and mirrored back again.

A minimal configuration looks like this:

properties
1clusters = A, B
2A.bootstrap.servers = kafka-a:9092
3B.bootstrap.servers = kafka-b:9092
4
5A->B.enabled = true
6A->B.topics = orders
7A->B.emit.checkpoints.enabled = true
8replication.policy.class = org.apache.kafka.connect.mirror.DefaultReplicationPolicy

If you disable or replace the replication policy without understanding the consequences, loop prevention becomes much harder.

You should also inspect the target cluster directly:

bash
kafka-topics.sh --bootstrap-server kafka-b:9092 --list

If you see both orders and A.orders or ever-growing chains such as A.B.orders, you likely have a topology or policy problem rather than a simple retry duplicate.

What You Can Improve

There is no single switch that turns MirrorMaker 2.0 into perfect exactly-once cross-cluster replication for every topology. Practical mitigation focuses on limiting duplicate windows and making consumers tolerant of them.

Useful steps include:

  • keep the default replication policy unless you have a strong reason to replace it
  • avoid bidirectional replication on the same topic unless you have designed for it explicitly
  • use record keys or event ids so consumers can detect repeated business events
  • monitor task restarts, rebalances, and connector failures because duplicates often correlate with them

If the business event already has a stable identifier, deduplication at the consumer or sink is usually the cleanest fix. For example, a database sink can upsert by event id instead of blindly inserting every replayed record.

A Simple Deduplication Example

The Python snippet below shows the consumer-side idea using an event id. It is not MirrorMaker code; it is a demonstration of how downstream systems usually protect themselves from retry delivery.

python
1messages = [
2    {"event_id": "e1", "amount": 10},
3    {"event_id": "e2", "amount": 20},
4    {"event_id": "e1", "amount": 10},
5]
6
7seen = set()
8processed = []
9
10for message in messages:
11    if message["event_id"] in seen:
12        continue
13    seen.add(message["event_id"])
14    processed.append(message)
15
16print(processed)

In production, the seen set would usually be a durable store or a unique database index rather than process memory.

Common Pitfalls

The biggest pitfall is assuming duplicates mean Kafka lost ordering guarantees or corrupted data. In many cases the system is behaving exactly as an at-least-once pipeline should.

Another mistake is enabling active-active replication without understanding topic renaming and loop prevention. That can multiply messages far beyond the occasional retry duplicate.

Teams also focus only on MirrorMaker configuration and ignore the sink. If the sink inserts every replay as a brand-new row, the duplicate problem remains even after connector tuning.

Summary

  • MirrorMaker 2.0 commonly delivers at-least-once, so some duplicates during failure recovery are expected.
  • Persistent duplicate growth often indicates a replication loop or bad topic policy.
  • Keep the default replication policy unless you are deliberately designing a custom topology.
  • Use stable event ids or idempotent sink logic to make consumers tolerant of replay.
  • Diagnose task restarts and topic naming before assuming MirrorMaker itself is broken.

Course illustration
Course illustration

All Rights Reserved.