Hadoop FileNotFoundExcepion when getting file from DistributedCache

Hadoop

DistributedCache

FileNotFoundException

Big Data

Troubleshooting

Hadoop FileNotFoundExcepion when getting file from DistributedCache

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

A FileNotFoundException when reading a file from Hadoop DistributedCache usually means the job and the runtime code disagree about where the cached file should appear. The mistake is often not in the map logic itself, but in the URI used when adding the cache file, the local filename the task expects, or the assumption that the file is still at its original HDFS path inside the task container.

The fix is usually to verify the cache registration step, inspect the local symlink or localized filename, and stop assuming the file should be opened from the original remote path.

How DistributedCache Actually Works

When you add a file to the distributed cache, Hadoop localizes it onto task nodes before the mapper or reducer runs. Inside the task, you normally access the localized file name, not the original HDFS URI string.

That distinction matters. If you add this file:

java

job.addCacheFile(new URI("hdfs:///user/app/lookup.csv#lookup.csv"));

then inside the task you usually open lookup.csv, not hdfs:///user/app/lookup.csv.

A Correct Access Pattern

A common pattern is to add the file with a fragment name and then open the localized file during setup.

java

1import java.io.BufferedReader;
2import java.io.FileReader;
3import java.io.IOException;
4import java.net.URI;
5import org.apache.hadoop.conf.Configuration;
6import org.apache.hadoop.fs.Path;
7import org.apache.hadoop.io.LongWritable;
8import org.apache.hadoop.io.Text;
9import org.apache.hadoop.mapreduce.Job;
10import org.apache.hadoop.mapreduce.Mapper;
11
12public class CacheExample {
13    public static class CacheMapper extends Mapper<LongWritable, Text, Text, Text> {
14        @Override
15        protected void setup(Context context) throws IOException, InterruptedException {
16            try (BufferedReader reader = new BufferedReader(new FileReader("lookup.csv"))) {
17                System.out.println(reader.readLine());
18            }
19        }
20    }
21
22    public static void configure(Job job) throws Exception {
23        job.addCacheFile(new URI("hdfs:///user/app/lookup.csv#lookup.csv"));
24        job.setMapperClass(CacheMapper.class);
25    }
26}

The #lookup.csv fragment gives the localized file a predictable local name.

Common Causes of the Exception

Typical reasons for the failure include:

the HDFS source path is wrong or misspelled
the file was never added to the job configuration
the code tries to open the original HDFS URI as if it were a local file path
the code expects the wrong localized filename
permissions prevent the file from being read or localized

All of these produce a symptom that looks like “file missing,” even though the underlying cause can be different.

Debugging Checklist

First, verify the HDFS file exists before submitting the job.

bash

hdfs dfs -ls /user/app/lookup.csv

Then verify the cache registration code really runs and uses the expected path.

Also log the cache files visible to the task:

java

1URI[] cacheFiles = context.getCacheFiles();
2if (cacheFiles != null) {
3    for (URI uri : cacheFiles) {
4        System.out.println("cache file: " + uri);
5    }
6}

If the job sees the URI but your code still cannot open the file, the remaining issue is usually the localized name you are trying to access.

Be Careful With Old Examples

Older Hadoop examples often use the legacy DistributedCache API directly. Modern code more commonly uses job.addCacheFile(...) or the context methods around cache files. If you are mixing old and new patterns, keep the access logic consistent and do not assume the old helper methods are required.

Common Pitfalls

A common mistake is adding the cache file from HDFS and then trying to open that same HDFS URI as a local file inside setup. Once localized, the task should normally read the local copy.

Another issue is forgetting the fragment alias and then guessing the wrong filename in task code.

Developers also sometimes test with a file that exists locally on one node and assume that proves the cache is configured correctly. It does not; localization has to work on task nodes, not just on the submit host.

Finally, do not ignore permissions. A file that exists but cannot be localized or opened is operationally indistinguishable from a missing file unless you inspect the logs carefully.

Summary

DistributedCache localizes files onto task nodes; task code usually reads the localized file name, not the original HDFS path.
Add cache files with a fragment alias when you want a predictable local filename.
Verify HDFS path, registration code, local filename expectations, and permissions.
Prefer job.addCacheFile(...) style code over stale legacy examples when possible.
Most FileNotFoundException cases come from path/name mismatches, not from the mapper logic itself.