Kafka
Java
FileNotFoundException
Replication
Troubleshooting

kafka embedded java.io.FileNotFoundException /tmp/kafka-7785736914220873149/replication-offset-checkpoint.tmp

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

This embedded Kafka test error usually means the broker tried to write checkpoint files into a temporary log directory that no longer existed or was no longer writable. In most cases the real cause is test lifecycle ordering: cleanup happens too early, parallel tests collide, or the temporary directory behaves differently in CI than on a local machine.

What the Checkpoint File Is

Kafka stores internal broker state in files under its log directories. One of those files is replication-offset-checkpoint, and the .tmp variant appears while Kafka is rewriting it safely.

In a real cluster those directories are stable. In embedded-test setups they are often under a temporary location such as /tmp/kafka-..., which is where the trouble starts. If the directory disappears before broker shutdown is complete, Kafka background threads can still attempt to write checkpoint data and hit FileNotFoundException.

So this error is usually about the test environment, not about Kafka replication logic being conceptually broken.

Use an Explicit Log Directory

A good first step is to stop relying on anonymous temporary paths chosen implicitly by the framework. Give each embedded broker a known writable directory:

java
1import java.nio.file.Files;
2import java.nio.file.Path;
3import java.util.Properties;
4
5Path logDir = Files.createTempDirectory("kafka-it-");
6
7Properties props = new Properties();
8props.put("log.dirs", logDir.toAbsolutePath().toString());
9props.put("offsets.topic.replication.factor", "1");
10props.put("transaction.state.log.replication.factor", "1");
11props.put("transaction.state.log.min.isr", "1");

This makes test behavior easier to understand and makes failures easier to reproduce.

It also prevents hidden coupling to shared JVM temp-directory behavior.

Shutdown Must Happen Before Cleanup

The most common direct cause of this exception is deleting the log directory before the broker has fully stopped.

Correct order:

java
1@AfterEach
2void tearDown() throws Exception {
3    if (broker != null) {
4        broker.shutdown();
5        broker.awaitShutdown();
6    }
7
8    if (logDir != null) {
9        Files.walk(logDir)
10             .sorted((a, b) -> b.compareTo(a))
11             .forEach(path -> {
12                 try {
13                     Files.deleteIfExists(path);
14                 } catch (Exception ignored) {
15                 }
16             });
17    }
18}

The key sequence is:

  1. stop the broker
  2. wait for shutdown to complete
  3. delete the files

If you reverse steps 1 and 3, the exception becomes very likely.

Parallel Tests Create Random-Looking Failures

If multiple embedded brokers run in parallel and share:

  • temp paths
  • fixed ports
  • the same broker ID assumptions

then the resulting failures often look intermittent and unrelated.

Each embedded broker instance should have isolated resources:

  • unique log directory
  • unique port allocation
  • independent teardown

A flaky embedded-Kafka suite is often just a resource-isolation bug disguised as “Kafka instability.”

CI Makes the Race More Visible

This problem often appears only in CI because CI environments are:

  • faster
  • more parallel
  • more aggressive about temp cleanup

A test that “always passes locally” may only be succeeding because local timing accidentally gives the broker enough time to finish shutdown before cleanup runs.

One practical stabilization step is to control the JVM temp directory explicitly:

bash
./gradlew test -Djava.io.tmpdir=$PWD/.tmp

That moves temp storage into the workspace, where behavior is usually more predictable than a shared system temp directory.

Framework Helpers Still Need Care

If you use Spring Kafka test utilities such as EmbeddedKafkaBroker, the same rules still apply. The abstraction helps with startup, but it does not remove the need for:

  • stable writable log directories
  • correct shutdown order
  • isolation between tests

So even when the exception appears inside a helper library, the likely root cause is still your test lifecycle and filesystem setup.

Common Pitfalls

The biggest mistake is deleting the broker log directory before shutdown() and awaitShutdown() have fully finished.

Another issue is letting multiple embedded brokers share temp directories or ports during parallel test execution.

People also assume /tmp behaves the same on local machines and CI runners. It often does not.

Finally, many teams call this random embedded-Kafka flakiness when the failure is actually a deterministic cleanup-order bug.

Summary

  • This exception usually means embedded Kafka lost access to its log directory during shutdown or checkpoint writing.
  • Use explicit writable log directories instead of relying on anonymous temp paths.
  • Always stop the broker completely before deleting any Kafka files.
  • Isolate directories and ports in parallel tests.
  • If the bug appears only in CI, inspect temp-directory behavior and test lifecycle ordering first.

Course illustration
Course illustration

All Rights Reserved.