DistributedCache Hadoop - FileNotFound
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Apache Hadoop's DistributedCache is a facility provided by the Hadoop framework to cache files (text, archives, jars and so on) needed by applications. Once a file is placed in the cache, Hadoop makes it available on each data node where map/reduce tasks are running, so that the job can access the data faster. This can greatly enhance the performance of the job since the data does not need to be transmitted across the network multiple times.
One common issue that developers encounter with DistributedCache is the "FileNotFoundException". This error may occur during the setup or execution phase of a Hadoop job, indicating that a file specified for inclusion in the cache cannot be found. Troubleshooting this error involves understanding how files are added to and accessed from the DistributedCache.
How DistributedCache Works
When a file is added to the DistributedCache, it is identified by a URI. If the URI schema is hdfs:, the file or archive is assumed to be on the Hadoop Distributed File System (HDFS). If the URI schema is file:, it resides on the local filesystem of the node where the job is submitted.
During job initialization, the ResourceManager ensures these resources are copied to every node's local storage where tasks (map/reduce) will execute. These resources are then symlinked in the working directory of each task, making access to these resources straightforward and consistent.
Common Causes of FileNotFoundException
- Incorrect URI: If the URI provided to the DistributedCache is not correct, or if it includes typographical errors, the specified resource cannot be located.
- Permission Issues: If Hadoop lacks reading permissions for the file, it cannot copy it into the cache.
- File Availability: Sometimes, files might be available at the time of the job submission but are deleted or moved before the actual task execution starts.
- Network Issues: In scenarios where files are stored on a network-mounted file system, network issues could block access to the files at the required moment.
- Misconfiguration: Misconfigurations in Hadoop’s environment settings might lead to looking for files in incorrect locations.
Example Scenario
In a map/reduce job where the task is to analyze textual data stored across several documents, you might place common configuration files or shared libraries into the DistributedCache. Below is an example of how files are added using Java:
Tips to Avoid FileNotFoundException
- Validate File Paths: Always check the correctness of file paths. This includes verifying schemes and overall URI syntax.
- Utilize Logging: Implement robust logging around the file addition phase. Log not just successes, but failures or permissions issues too.
- File Persistence: Ensure required files are not subject to change (delete, move) during job execution.
- Access Permissions: Prior to job submission verify that Hadoop has necessary permissions to access and copy the files.
- Test Locally: Before deploying the code on a cluster, test file loading locally using stand-alone Hadoop instance or simulators.
Essential Points in a Table
| Issue Component | Consideration Point |
| URI Integrity | Verify correctness of file path and URI scheme |
| File Accessibility | Check file existence and stability over job's life cycle |
| Permissions | Ensure read/write permissions for Hadoop processes |
| Configuration | Confirm no misconfigurations in file-path settings |
| Testing & Logging | Implement detailed logging and local testing strategies |
By understanding the workings and common pitfalls related to DistributedCache, developers can more effectively use this tool to enhance the efficiency and reliability of their Hadoop jobs. Moreover, careful planning, testing, and validation are key to avoiding "FileNotFoundException" and other common issues in file management within the Hadoop ecosystem.

