Hadoop FileNotFoundExcepion when getting file from DistributedCache
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
A FileNotFoundException when reading a file from Hadoop DistributedCache usually means the job and the runtime code disagree about where the cached file should appear. The mistake is often not in the map logic itself, but in the URI used when adding the cache file, the local filename the task expects, or the assumption that the file is still at its original HDFS path inside the task container.
The fix is usually to verify the cache registration step, inspect the local symlink or localized filename, and stop assuming the file should be opened from the original remote path.
How DistributedCache Actually Works
When you add a file to the distributed cache, Hadoop localizes it onto task nodes before the mapper or reducer runs. Inside the task, you normally access the localized file name, not the original HDFS URI string.
That distinction matters. If you add this file:
then inside the task you usually open lookup.csv, not hdfs:///user/app/lookup.csv.
A Correct Access Pattern
A common pattern is to add the file with a fragment name and then open the localized file during setup.
The #lookup.csv fragment gives the localized file a predictable local name.
Common Causes of the Exception
Typical reasons for the failure include:
- the HDFS source path is wrong or misspelled
- the file was never added to the job configuration
- the code tries to open the original HDFS URI as if it were a local file path
- the code expects the wrong localized filename
- permissions prevent the file from being read or localized
All of these produce a symptom that looks like “file missing,” even though the underlying cause can be different.
Debugging Checklist
First, verify the HDFS file exists before submitting the job.
Then verify the cache registration code really runs and uses the expected path.
Also log the cache files visible to the task:
If the job sees the URI but your code still cannot open the file, the remaining issue is usually the localized name you are trying to access.
Be Careful With Old Examples
Older Hadoop examples often use the legacy DistributedCache API directly. Modern code more commonly uses job.addCacheFile(...) or the context methods around cache files. If you are mixing old and new patterns, keep the access logic consistent and do not assume the old helper methods are required.
Common Pitfalls
A common mistake is adding the cache file from HDFS and then trying to open that same HDFS URI as a local file inside setup. Once localized, the task should normally read the local copy.
Another issue is forgetting the fragment alias and then guessing the wrong filename in task code.
Developers also sometimes test with a file that exists locally on one node and assume that proves the cache is configured correctly. It does not; localization has to work on task nodes, not just on the submit host.
Finally, do not ignore permissions. A file that exists but cannot be localized or opened is operationally indistinguishable from a missing file unless you inspect the logs carefully.
Summary
- DistributedCache localizes files onto task nodes; task code usually reads the localized file name, not the original HDFS path.
- Add cache files with a fragment alias when you want a predictable local filename.
- Verify HDFS path, registration code, local filename expectations, and permissions.
- Prefer
job.addCacheFile(...)style code over stale legacy examples when possible. - Most
FileNotFoundExceptioncases come from path/name mismatches, not from the mapper logic itself.

