Codemia | Master System Design Interviews Through Active Practice

My Solution for Design a Code Deployment System with Score: 8/10

by john_chen

System requirements

Functional Requirements:

Code Building:
The system should allow users to initiate build requests using commit SHA identifiers.
It should fetch the source code snapshots based on the provided commit SHA.
The system must compile the source code into a binary file, with each build taking up to 15 minutes.
The resulting binary file can be up to 10GB in size.
The system should provide a clear end-state for each build (SUCCESS or FAILURE).
Code Deployment:
The system should deploy the built binary to hundreds of thousands of target machines across 5-10 global regions.
The deployment process, including building and deploying, should take a maximum of 30 minutes.
The system should track and report the deployment status (SUCCESS or FAILURE) for each target machine.
Scalability:
The system must handle thousands of build and deployment requests per day.
It should efficiently manage the deployment process to hundreds of thousands of machines.
User Interaction:
Users should be able to initiate build and deployment requests easily.
Users should receive status updates for their requests, including any errors or issues encountered.

Non-Functional Requirements:

Performance:
The system should ensure that the build process takes up to 15 minutes.
The entire deployment process should take a maximum of 30 minutes.
The system should handle high volumes of requests without significant delays.
Reliability:
The system should have an availability target of 2 to 3 nines (99% to 99.9% uptime).
The system should be able to handle failures gracefully and recover without losing data or state.
Scalability:
The system should be able to scale horizontally to handle increasing numbers of build and deployment requests.
It should be able to scale storage and processing capabilities as needed.
Security:
The system should ensure secure access to the source code, build environments, and deployment targets.
It should implement appropriate authentication and authorization mechanisms for users.
Maintainability:
The system should be designed with modular components to allow easy maintenance and updates.
It should provide clear logging and monitoring capabilities to track system performance and issues.
Data Integrity:
The system should ensure the integrity of the source code, build artifacts, and deployment packages.
It should implement checks to prevent data corruption or loss.

Capacity estimation

Build System - Scale Estimation

To ensure our build system can handle the expected load, we need to estimate the number of workers required based on build duration and daily build volume.

Worker Capacity:

Each build takes up to 15 minutes.
A single worker can complete approximately 4 builds per hour (60 minutes / 15 minutes per build).
This results in roughly 96 builds per day per worker (4 builds per hour * 24 hours).

Daily Build Volume:

We expect thousands of builds per day, let's say between 5,000 to 10,000 builds.

Worker Estimation:

To handle 5,000 builds per day:

5,000 builds / 96 builds per worker = ~52 workers

To handle 10,000 builds per day:

10,000 builds / 96 builds per worker = ~104 workers

Thus, we would need between 50 to 100 workers to handle the expected load, validating our initial assumption of 100 workers.

Scalability:

Horizontal Scaling: Our system can easily scale horizontally by adding or removing workers based on load demands. This is particularly useful for handling peak times, such as during work hours.
Vertical Scaling: We can also scale vertically by increasing the computational power of our workers, reducing build times and thereby increasing the number of builds each worker can handle per day.

Storage

Build System - Storage

In our design, the storage of binaries plays a critical role in ensuring that build artifacts are available for deployment across our global regions. Here's how storage fits into our queueing system:

Storing Binaries:
When a worker completes a build, it stores the resulting binary in Google Cloud Storage (GCS).
The worker then updates the relevant row in the jobs table to mark the job as SUCCEEDED.
This ensures that the binary has been persisted before the job is marked as complete.
Regional Storage:
To optimize deployment times and reliability, we will use regional storage rather than a single global blob store.
We will design our system around regional clusters in our 5-10 global regions. Each region will have its own blob store (a regional GCS bucket).
Asynchronous Replication:
Once a worker successfully stores a binary in the primary blob store, it is released to handle another job.
The primary blob store then asynchronously replicates the binary to all regional GCS buckets.
This replication step ensures that binaries are quickly available in all regions, minimizing deployment latency.
Timing Considerations:
Given that we have 5-10 regions and each binary can be up to 10GB in size, the replication process should take no more than 5-10 minutes.
This brings our total build-and-deploy duration to approximately 20-25 minutes (15 minutes for building and 5-10 minutes for global replication).

High-level design

To effectively design our system, we need to break it down into major components. Our system can be divided into two clear subsystems:

Build System: Responsible for building code into binaries.
Deployment System: Responsible for deploying binaries to machines across the world.

These subsystems will each contain multiple components, but this division provides a straightforward initial approach.

Detailed component design

Build System - General Overview

At a high level, the process of building code into a binary can be referred to as a job. The build system will be designed as a queue of jobs. Each job includes a commit identifier (commit SHA) for the code version to build and the name of the resulting binary artifact. The specifics of the code-building process are abstracted, allowing us to focus on system infrastructure.

A pool of servers (workers) will handle these jobs. Workers will dequeue jobs in a FIFO manner, build the binaries, and store the resulting binaries in blob storage (e.g., Google Cloud Storage or Amazon S3).

Build System - Job Queue

An in-memory implementation of the job queue is problematic because server failures would result in the loss of all queued and past jobs. Instead, we will implement the queue using a SQL database.

Build System - SQL Job Queue

The jobs table in our SQL database will have the following schema:

id (string): The job ID, auto-generated.
created_at (timestamp): The creation time of the job.
commit_sha (string): The commit SHA for the code version.
name (string): A pointer to the job's eventual binary in blob storage.
status (string): The job status (QUEUED, RUNNING, SUCCEEDED, FAILED).

We will implement the dequeuing mechanism by querying the oldest job with a QUEUED status. This requires indexing the table on both created_at and status.

Build System - Concurrency

Using ACID transactions will ensure that multiple workers can safely dequeue jobs without running the same job twice. The transaction for dequeuing a job will be:

BEGIN TRANSACTION;
SELECT * FROM jobs_table WHERE status = 'QUEUED' ORDER BY created_at ASC LIMIT 1;
-- If no job is found, ROLLBACK;
UPDATE jobs_table SET status = 'RUNNING' WHERE id = id from previous query;
COMMIT;

Workers will run this transaction periodically (e.g., every 5 seconds). With 100 workers, we would have 20 reads per second, easily manageable by a SQL database.

Build System - Handling Lost Jobs

To handle potential lost jobs due to worker failures or network issues, we add a last_heartbeat column to the jobs table. Workers update this column every 3-5 minutes to indicate the job is still running.

A separate service will poll the table periodically (e.g., every 5 minutes). If a RUNNING job's last_heartbeat is older than a threshold (e.g., 10 minutes), the service will reset the job's status to QUEUED:

UPDATE jobs_table SET status = 'QUEUED' WHERE
  status = 'RUNNING' AND
  last_heartbeat < NOW() - INTERVAL '10 minutes';

Deployment System

The deployment system is responsible for distributing the built binaries to the target machines across various regions. It includes the following components:

Deployment Queue: A queue of deployment tasks, similar to the build job queue.
Deployment Workers: Servers that take deployment tasks from the queue and deploy binaries to target machines.
Monitoring and Logging: Components to track deployment status and handle retries or failures.

General Overview:

Our deployment system needs to enable the rapid distribution of 10GB binaries to hundreds of thousands of machines across various global regions. The system will include:

A service to track the replication status of binaries.
A source-of-truth service to determine which binary should be run on all machines.
A peer-to-peer (P2P) network design to facilitate fast distribution among machines.

Deployment System Components

1. Replication-Status Service:

Function: Continuously checks regional GCS buckets to verify that binaries have been replicated across all regions.
Operation: Once replication is confirmed, it updates a SQL database with the binary's name and replication status.
Outcome: When a binary’s replication_status is "complete," it is officially deployable.

2. Blob Distribution:

Challenge: Deploying 10GB binaries to numerous machines can be slow if each machine downloads the binary individually.
Solution: Utilize a peer-to-peer network within each regional cluster to expedite the distribution process.

3. Deployment Trigger:

Triggering Event: An engineer initiates deployment by pressing a button on an internal UI (e.g., "Deploy build/binary B1 globally").
Goal-State Oriented Design: Maintain desired build versions in a key-value (K-V) store for global and regional goal states.
Global K-V Store: Stores the current global build version.
Regional K-V Stores: Poll the global K-V store for updates and update their respective build versions.
Machines in Clusters: Poll the regional K-V stores for the latest build version, fetch it from the P2P network, and run the binary.

This design ensures a robust and efficient deployment system capable of handling large-scale binary distributions globally. By leveraging regional clusters, peer-to-peer networks, and a goal-state oriented approach, we achieve high performance, scalability, and reliability.