How does Git store files?

Git

file storage

version control

programming

software development

How does Git store files?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Git is a distributed version control system that is widely used for tracking changes in source code during software development. One of the remarkable aspects of Git is the way in which it stores files and tracks changes. To truly understand the power of Git and how it efficiently manages versions, it is essential to delve into the details of its file storage system.

Understanding Git's Storage Mechanism

Git uses a combination of a content-addressable file system and a series of pointers to store information. Each file's content is stored in an object database, which is both space-efficient and effective in tracking changes across versions.

Key Concepts in Git Storage

Core Git Objects: Git's storage system is built around four core object types:
- Blob: Stores file data.
- Tree: Represents a directory of files, mapping file names to blob identifiers.
- Commit: Stores metadata about changes, including pointers to trees and parent commits.
- Tag: Tracks a specific commit and provides a user-friendly label to it.
SHA-1 Hashing: Each object is identified by a SHA-1 hash, a 40-character checksum that is unique to the contents of the object. This makes Git's objects content-addressable, meaning they are accessed based on content rather than file name or location.
Immutable Data: Once an object is created, it is immutable. Any change results in a new object, complicating modification history but ensuring data integrity.
Object Storage: All objects are stored under the `.git/objects` directory, organized by the first two characters of their SHA-1 hash to form subdirectories. This prevents too many files in a single directory, improving filesystem performance.
Delta Compression: For efficiency, Git performs delta compression, especially during packing operations. Related objects are stored as differences relative to each other, reducing space requirements for large histories.

How Git Represents Files and Changes

When you make changes in your project and commit those changes with Git, the following operations occur:

Blob Creation: On adding or modifying a file, a blob is created containing the file content.
Tree Construction: A tree object is constructed to index the blob and any other files in the directory.
Commit Record: A commit object is generated, referencing the tree, parent commits, and storing commit metadata (author, timestamp, message).

Example of Commit Process

Assume you have modified `file1.txt` and added `file2.txt` in your repository. After staging and committing, Git will:

Create blobs for `file1.txt` and `file2.txt` (if contents have changed)
Create a tree object representing the updated directory
Create a commit object that points to the tree object and previous commit

This structure allows you to navigate backward through commits, ensuring you can track changes effectively.

Git Repositories and Branches

Git branches are simply pointers to specific commit objects, enabling multiple development lines. Merging branches creates a new commit that combines the changes, and reusing shared commit objects minimizes disruption.

Summary Table

Concept	Description
Blob	Stores file data as a content-addressable object
Tree	Represents a directory, maps filenames to blobs
Commit	Stores changes metadata, points to trees and parent commits
SHA-1 Hash	Unique identifier for each Git object, based on content
Immutable Objects	Once created, object contents do not change
Object Storage	Stored under `.git/objects`, organized by SHA-1 hash
Delta Compression	Stores differences between related objects to save space, applied during packing operations
Branches	Pointers to specific commits, allowing multiple parallel development lines

Conclusion

Git's ingenious storage methodology ensures data integrity, efficient space usage, and consistent tracking of file versions. By using a combination of hashing, trees, blobs, and rigorous data structures, Git provides a robust environment for managing code changes in a distributed manner. Understanding these underlying mechanisms not only improves your practical use of Git but also deepens your appreciation for its design efficiency.