Design A Task Scheduler - System Design

System requirements

Functional:

Users can schedule both one-time and recurring tasks.
Users can define and execute tasks using a Directed Acyclic Graph (DAG).

Non-Functional:

The system must ensure reliable task execution with fault tolerance.
Task execution should have low response time to minimize delays.
The system must be scalable, supporting thousands of concurrent task executions per second.

Capacity estimation

Storage for Task Queue

Scheduled jobs must be queued before execution, requiring temporary storage.

Assume 10K jobs per second are scheduled

Each job is stored in the queue for ~1 second before execution

Each scheduled task entry takes ~1 KB (including timestamps, metadata, and execution target)

Storage for Queued Tasks

$10,000 \text{ jobs/sec} \times 1 \text{ sec} \times 1 \text{ KB/job} = 10 \text{ MB (queued tasks at any time)}$ With replication (3x for reliability), the total storage needed at any time is ~30 MB.

Job Execution Storage

Each job runs for 5 minutes, meaning at any given time, the system must store jobs that are still executing.

Jobs running concurrently: $10,000 \text{ jobs/sec} \times 300 \text{ sec (5 min)} = 3 \text{ million active jobs}$

Storage required for active jobs (excluding logs): $3M \times 1 \text{ KB/job} = 3 \text{ GB (metadata)}$

Storage required for active jobs (including logs): $3M \times 11 \text{ KB/job} = 33 \text{ GB}$

At any given time, active job storage needs 3 GB (metadata) and 33 GB (including logs).

Daily Storage for Completed Jobs

Once jobs complete, they are stored in long-term storage for historical tracking and auditing.

Jobs per day: $10,000 \text{ JPS} \times 86,400 \text{ sec/day} = 864 \text{ million jobs/day}$

Daily storage required (including logs): $864M \times 11 \text{ KB} = 9.5 \text{ TB/day}$

Daily storage required (metadata only): $864M \times 1 \text{ KB} = 864 \text{ GB/day}$

API design

POST /tasks - Schedules a single job with optional execution constraints

POST /tasks/recurring - Creates a task that runs on a cron-like schedule.

GET /tasks/{task_id} - Get task status

DELETE /tasks/{task_id} - Cancel a task

POST /workflows - Defines a DAG (Directed Acyclic Graph) where tasks execute based on dependencies.

GET /workflows/{workflow_id}/status - Retrieves the status of an ongoing workflow.

Database design

Redis

Given our storage estimation for active tasks, we can use Redis for fast task queuing.

PostgreSQL

PostgreSQL is used to store task definitions, scheduling metadata, and dependencies.

Tasks Table
- Stores all scheduled tasks, including execution timestamps, status, and recurrence settings.
- Provides historical consistency, ensuring that tasks persist even if Redis restarts.
- Tasks are inserted into Redis when they are due for worker polling.

Recurring Tasks Table
- Tracks periodic tasks separately, ensuring that new instances are generated on schedule. This denormalization allows us to query more efficiently.
- Uses a background process that scans upcoming executions and pushes them to Redis.

DAG Dependencies Table
- Defines task execution order for workflows that require dependency resolution.
- When a task completes, its dependent tasks are pushed into Redis only when all prerequisites are met.

Worker Table
- Stores worker information for historical tracking and resource usage analysis.
- Redis keeps live worker statuses, while PostgreSQL tracks long-term worker performance.

Elasticsearch

Since historical task execution data does not require strong consistency but benefits from fast retrieval and aggregations, it is stored in Elasticsearch. The system logs task execution time, worker ID, and output details in Elasticsearch, allowing efficient queries without impacting the primary relational database.

High-level design

Many orchestration systems follow a master-worker architecture, where a master node schedules tasks and worker nodes execute them. However, different systems implement variations of this model to optimize for scalability, workload distribution, and efficiency.

Several well-known distributed systems exemplify this approach:

SLURM (Simple Linux Utility for Resource Management) follows a push-based model, where a central controller assigns jobs to compute nodes based on CPU and memory availability.
Kubernetes schedules Pods using its API server, but workers (Kubelets) pull tasks, making it a hybrid push-pull system.
Apache Kafka employs a controller node to manage partition leadership and broker coordination but does not actively assign message processing.
RabbitMQ relies on a pure pull model, where consumers fetch messages from queues when ready.

Each system employs different scheduling policies based on its use case—whether least-loaded selection, round-robin scheduling, or leader election mechanisms.

When designing a task scheduler, two critical considerations are:

Latency Requirements – A real-time scheduler should minimize communication overhead, avoid unnecessary API calls. However this would possibly mean tighter coupling and potentially less scalability in terms of distribution. It would be important to consider what would be acceptable in order to find an acceptable middle ground.
Task Size Variability – If tasks are relatively small in size, a simple load distribution model may suffice (eg, worker pulling from queue). If task sizes vary significantly, dynamic load balancing might be necessary to prevent resource bottlenecks. This would mean pushing the task to the most suitable worker.

In our design we will start with a basic master-worker architecture and then explore different trade offs and approaches below.

Request flows

Client Request Submission

The client submits a job scheduling request through the API server, specifying the job type, execution time, and any dependencies.

API Server (BFF - Backend for Frontend)

The API server acts as the gateway between the client and the scheduling system. It validates the request, checks permissions, and writes the job details to the PostgreSQL scheduling database for persistence. The API server then forwards the request to the Master Server for processing.

Master Server (Scheduler Controller)

The Master Server is responsible for scheduling and managing job execution. It:

Reads the request from the database and determines when the job should run.
If the job is scheduled in the future, it stores it into PostgresSQL Recurring Tasks Table.
If the job should run immediately, it is pushed directly into Redis queues for worker consumption.
If the job is part of a DAG, the Master Server ensures all dependencies are satisfied before execution.
Writes task status updates back to PostgreSQL for tracking and consistency.

Worker Server (Task Executor)

Workers poll Redis for available tasks using BZPOP (blocking pop) to minimize CPU usage and execute tasks as soon as they are enqueued.
Upon picking up a task, the worker updates PostgreSQL to mark the task as running.
Once execution is complete, the worker updates the status to completed or failed and stores execution logs in Elasticsearch for historical tracking.
If the task is part of a DAG, the worker notifies Redis or the Master Server to trigger dependent tasks.

Failure Handling & Retries

If a worker crashes or fails to complete a task, the Master Server detects this via worker heartbeats stored in Redis.
Failed tasks are automatically re-enqueued into a retry queue with exponential backoff.
If a task fails repeatedly, the system triggers an alert and marks it as permanently failed in PostgreSQL.

Detailed component design

Master Server

Single Master via Push

One key design decision for the master server is determining whether tasks should be pushed to workers or pulled from a queue.

In a push-based model, the master assigns tasks directly to workers, optimizing for CPU and memory constraints. This allows load balancing across workers and offers lower latency than pull-based scheduling since workers do not need to poll the queue.

A simple approach is to queue the tasks in the master’s local memory (while also storing them in a persistent DB in case of failure), with a single active master handling scheduling while standby instances monitor and take over in case of failure. Many open-source systems use this approach as it is straightforward and meets most task scheduling requirements.

Alternatively, tasks can be stored in an external memory-based queue like Redis instead of being held in the master’s memory. In either case, the master runs a dedicated scheduling thread that pulls tasks from the queue, checks worker availability, and assigns tasks using an internal load-balancing algorithm.

Push-based scheduling ensures low-latency task assignment and allows the master to control execution order and worker selection. However, as the number of workers scales into the thousands, load balancing can become a challenge. Despite this, push-based scheduling is preferred for workloads requiring strict task sequencing and controlled resource allocation.

To track worker availability, Zookeeper maintains a list of active workers. When a worker joins the cluster, it registers an ephemeral node in Zookeeper (e.g., /workers/). If a worker fails or disconnects, its node is automatically removed, signalling the master to exclude it from scheduling.

Single Master via Push Failure Handling

Since the master is responsible for assigning tasks to workers, its failure could disrupt scheduling unless there is a mechanism to detect the failure and elect a new master. ZooKeeper provides leader election, failure detection, and worker tracking to ensure system reliability.

The master registers itself with ZooKeeper upon startup by creating an ephemeral znode under a predefined path, such as /master. An ephemeral znode is automatically deleted if the process that created it crashes or loses connection to ZooKeeper. Other standby master nodes watch this znode for changes. When the active master fails, its ephemeral znode disappears, and ZooKeeper triggers an event notifying the standby nodes that the leader is gone. The standby masters then compete to create a new ephemeral znode under /master, and the one that successfully creates it becomes the new active master.

The new master must ensure that it resumes scheduling without duplicate task execution. Since tasks were being actively assigned by the previous master, there may be some tasks that were assigned but not yet completed. To prevent reassigning tasks that are still in progress, the new master can check the worker status in ZooKeeper to see if any workers are still processing tasks. If workers fail along with the master, any incomplete tasks must be re-queued to avoid being lost.

Pull Based Approach

A push-based approach is effective for handling moderate throughput with low latency and optimal task assignment. The master actively assigns tasks to workers based on resource availability, ensuring efficient execution. However, as throughput increases to levels such as 10K+ jobs per second, a single master becomes a bottleneck since managing load balancing across thousands of workers increases scheduling overhead, making a push-only approach inefficient.

A pull-based approach addresses scalability issues by allowing workers to fetch tasks dynamically, reducing the master’s responsibility for direct load balancing. This improves scalability but results in less optimal task assignments since workers select tasks independently. When tasks are similar in size, this approach is sufficient because it does not matter which worker picks up a task. However, if task sizes vary significantly, the system can become inefficient. Workers lack visibility into task complexity before pulling, which may lead to large tasks being assigned to underpowered workers. Some workers may remain idle while others become overloaded, resulting in resource imbalance and task starvation.

To improve task assignments in a pull-based system, workers can query a metadata store such as Redis before pulling a task. By filtering tasks based on their available resources, workers avoid pulling jobs they cannot handle. This approach reduces inefficiencies but introduces additional metadata management overhead. Despite improvements, worker starvation can still occur if certain workers never meet the resource requirements for available tasks.

A push-based approach is more suitable for low-latency scheduling when the master has sufficient capacity to manage assignments. A pull-based approach scales better for high-throughput workloads but can may require additional overhead for optimal task distribution.

Scheduling and Recurring Tasks

Handling scheduled and recurring tasks in a stateless, pull-based task scheduler relies on maintaining a persistent schedule store, periodically scanning for tasks that need execution, and enqueuing them at the right time. The master does not assign tasks directly but instead writes tasks to a queue, allowing workers to pull them dynamically. This approach ensures scalability while eliminating the need for centralized tracking of execution status.

PostgreSQL stores task schedules, including information about execution frequency and the next scheduled run time. A background process of the master continuously queries this schedule store, identifying tasks that are due and enqueuing them in the task queue for workers to execute. Each task that is enqueued is either removed from the schedule store if it is a one-time job or updated with its next execution time if it is a recurring job.

The worker processes do not distinguish between scheduled tasks and normal tasks since they pull from the same queue. Once a worker completes execution, it does not need to notify the master because the scheduler is only responsible for enqueuing tasks, not tracking their completion. The queue system itself ensures that only one worker processes each task, preventing duplicate execution without requiring locks or coordination between masters.

DAG

With a pull based approach the master does not assign tasks to workers directly; instead, it stores DAG metadata in a dependency tracker such as Redis. When a new DAG is submitted, the master breaks it into tasks and determines dependencies. Tasks with no dependencies are immediately pushed to a queue, while dependent tasks are stored in Redis with a reference to their prerequisites.

Workers continuously listen to the queue and pull tasks that are ready for execution. Once a worker completes a task, it updates the dependency tracker. Each dependent task has a counter tracking the number of remaining prerequisites. When this count reaches zero, the dependent task is pushed to the queue, allowing the next stage of execution to proceed without requiring master intervention. This ensures that tasks execute as soon as they are ready while keeping scheduling overhead minimal.

This approach prevents the master from becoming a bottleneck since it only writes task data to Redis, which can handle over a million writes per second. Workers self-assign tasks, eliminating the need for a centralized scheduler to track worker availability. As a result, execution scales naturally by adding more workers. If the master fails, the system continues operating since all execution logic resides in the queue and dependency tracker. A backup master can restart and resume DAG tracking without affecting ongoing execution.

By limiting the master’s function to dependency tracking and queue management, this design ensures high-throughput DAG execution while maintaining simplicity. The system remains efficient because tasks are scheduled dynamically without polling, and execution is event-driven rather than master-controlled.

Worker Server

The worker server is responsible for pulling tasks from the queue, executing them, updating dependency tracking, and ensuring dependent tasks are triggered when ready. Unlike traditional task scheduling where a master assigns tasks, the worker self-assigns tasks by continuously listening to a distributed queue. When a new task is available, the worker pulls it from the queue and begins execution. The execution method depends on the workload; the worker may run a local script, execute a shell command, launch a container, or submit a distributed job to a framework like Spark or Flink.

Once execution completes, the worker updates the system by notifying the dependency tracker. The task’s completion decrements the dependency count of any downstream tasks, and if all dependencies for a task are resolved, it is pushed into the queue, allowing the next stage of execution to begin. Workers dynamically adjust their workload based on available resources, pulling only tasks they have the capacity to execute. Monitoring tools track CPU and memory usage, preventing workers from overloading themselves.

The decentralized nature of this approach allows the system to scale efficiently. The master is only responsible for maintaining DAG structure and tracking dependencies, while workers fully handle execution and task coordination. This design eliminates centralized scheduling bottlenecks, ensuring that tasks are processed as soon as their dependencies are resolved. Workers operate autonomously, continuously pulling tasks, executing them, and triggering the next stage of the workflow without requiring direct intervention from the master.

Trade offs/Tech choices

Hybrid Push and Pull Approach

A hybrid model requires dual scheduling logic, with separate pathways for pushed and pulled tasks, state synchronization across masters, and more nuanced failure recovery mechanisms. Debugging becomes more difficult as tracing a task’s lifecycle across two scheduling mechanisms introduces ambiguity.

Resource contention can emerge when push and pull logic compete for shared worker pools, and the coordination overhead from managing partitioned masters or distributed locks like ZooKeeper further complicates operations.

While systems like Netflix’s Titus adopt hybrid scheduling to accommodate highly heterogeneous workloads, such as latency-sensitive services mixed with batch jobs, this level of complexity is rarely necessary for most task scheduling systems.

Partitioning

For high-throughput scheduling, scaling the scheduler horizontally requires multiple instances to work in parallel, each handling a different subset of scheduled tasks. A single scheduler can become a bottleneck due to the overhead of querying tasks, checking worker availability, and handling retries. To distribute the load efficiently, schedulers should be partitioned so that each instance queries only a specific portion of the schedule store.

Partitioning the schedule store can be done in several ways. A time-based partitioning strategy assigns each scheduler a time range, allowing it to process tasks scheduled within that window.

For example, one scheduler handles tasks scheduled between midnight and 6 AM, while another processes those between 6 AM and noon. This approach ensures queries are distributed efficiently, reducing contention in the database. However this approach doesn't really improve processing tasks in parallel as only one scheduler is working within it's designated interval.

Another option is hash-based partitioning, where tasks are assigned to schedulers based on a hash function, such as task ID modulo the number of schedulers. Each scheduler processes tasks belonging to its assigned hash range, which balances workloads dynamically.

A workload-based partitioning strategy can also be used, where different schedulers handle tasks based on priority or resource requirements. This method ensures that CPU-intensive tasks, I/O-bound tasks, and background jobs are distributed appropriately to prevent overloading any single instance.

For our design we will choose hash base partitioning since we like to handle high throughput by using multiple schedulers concurrently.

Adding Scheduler

Scaling scheduler instances dynamically requires a service discovery mechanism to track active schedulers and distribute partitions accordingly. Using Zookeeper, schedulers register themselves, and a coordinator process assigns partitions dynamically. If a scheduler crashes, its tasks are reassigned to a healthy instance, ensuring continuous scheduling without manual intervention.

By partitioning the schedule store, preventing duplicate execution, and enabling dynamic scaling, multiple scheduler instances can efficiently handle millions of tasks per second without introducing contention or overloading any single component. This architecture ensures that schedulers operate independently while maintaining consistency and fault tolerance in a distributed system.

Using Redis Sorted Sets for Scheduling Store Instead of PostgreSQL

Using Redis Sorted Sets (ZSET) for scheduling can reduce database load and improve efficiency by allowing tasks to be indexed by execution time and retrieved with minimal overhead. In a traditional PostgreSQL-based scheduler, tasks are stored in a table with an execute_at timestamp, requiring frequent queries to fetch due tasks. As task volume increases, these queries introduce read contention, slow performance, and create scaling bottlenecks. In contrast, Redis keeps tasks in-memory and retrieves them in O(log N) time, making it significantly faster and reducing database dependency.

In this approach, each task is added to a Redis Sorted Set where the task ID is the element and the execution timestamp is the score. This allows efficient retrieval of tasks that are ready to run. A scheduler process runs periodically, querying Redis for tasks whose execution time has arrived, executing them, and removing them from the queue. This eliminates the need for expensive database queries, significantly improving scheduling latency.

Unlike a database-based approach, Redis avoids contention issues because queries operate on an in-memory structure rather than scanning disk-based indexes. However, in a distributed system with multiple schedulers, task deduplication is necessary to prevent multiple instances from picking up the same task. This can be handled using Redis locks (SETNX), ensuring only one scheduler processes a given task. If a scheduler crashes mid-execution, another instance can safely retry without risking duplicate execution.

Scaling this system involves partitioning tasks across multiple Redis instances. Sharding can be done by execution time, where different Redis nodes handle tasks scheduled within specific time windows, or by consistent hashing on task IDs. Redis Cluster can also be used to distribute tasks across multiple nodes while maintaining atomic operations. For extremely high throughput, Redis Streams (XADD, XREADGROUP) provide a more scalable alternative, enabling better worker coordination.

Failure handling in this model requires re-enqueuing failed tasks with a delayed execution timestamp. Tasks that fail during execution can be pushed back into Redis with a retry delay, ensuring they are retried without blocking the main queue. This allows tasks to be retried automatically without requiring database intervention.

By replacing PostgreSQL with Redis Sorted Sets for scheduling, the system achieves lower latency, reduced contention, and better scalability. This approach is ideal for real-time scheduling but may require Redis Cluster or additional partitioning for very large workloads.

Failure scenarios/bottlenecks

Worker Race Conditions

A method to prevent worker race conditions is FIFO-based atomic task retrieval. In Redis, using BLPOP (blocking list pop) ensures that only one worker gets a task at a time. This eliminates the possibility of multiple workers fetching the same task simultaneously, since Redis guarantees that BLPOP removes the task immediately upon retrieval. This approach is ideal for low-latency task execution where immediate task assignment is required.

Idempotency

Ensuring idempotency in task execution is also crucial. If a task is retried due to failure or a race condition, it must not produce unintended side effects. We can use a task ID registry in Redis to track completed tasks to prevent duplicate execution.

Worker Failure Handling

When a task failure occurs, meaning the worker attempts execution but encounters an error, the system retries the task a configurable number of times before marking it as permanently failed. If the task continues to fail after exhausting all retries, it is either logged for manual review, moved to a dead-letter queue for further analysis, or escalated via alerts. This ensures that transient failures, such as temporary database outages or API timeouts, do not result in lost tasks, while genuinely failing tasks do not consume excessive resources.

When a worker failure occurs, meaning a worker crashes or becomes unresponsive before completing its assigned tasks, the system must detect the failure and reassign those tasks to another worker. Workers send periodic heartbeats to a coordination system such as Zookeeper, Redis, or etcd. If a worker stops sending heartbeats beyond a predefined timeout, it is marked as failed, and any tasks it was executing are returned to the queue to be picked up by another available worker.

In Redis-based task management, failed tasks can be moved to a retry queue, where they are reprocessed after a delay, preventing immediate re-execution in case of temporary issues. If the system uses a database-backed scheduler, a background process periodically checks for tasks that have been marked as "in progress" for too long. If a task has exceeded a predefined execution time without completion, it is assumed to be orphaned due to a worker failure and returned to the pending state for reassignment.

By differentiating between task failure and worker failure, the system ensures fault tolerance while avoiding unnecessary duplication of tasks. Task retries help handle transient issues without reassigning tasks prematurely, while worker failure detection prevents tasks from being lost if a worker crashes.

Future improvements

Kafka to Buffer Scheduled Jobs

Using Kafka to buffer scheduled jobs allows workers to consume tasks in an event-driven manner rather than relying on database polling. Traditional scheduling systems query a database for tasks that are due, which introduces read contention, increases query latency, and struggles to scale under high throughput. Instead, Kafka acts as a distributed queue where tasks are pushed to a topic as soon as they are ready, allowing workers to consume them instantly without polling overhead.

When a task is scheduled, it is first stored in a persistent store such as Redis Sorted Sets or a relational database with an execution timestamp. A lightweight dispatcher process continuously scans for tasks that are due and publishes them to Kafka, ensuring they are delivered to the appropriate consumers. Once a task is pushed to Kafka, workers subscribe to the topic and process tasks as soon as they arrive. This eliminates the delay of scheduled polling and reduces load on the database.

Kafka ensures task reliability through its retention mechanism, meaning if a worker crashes before completing a task, it can be reassigned to another worker without loss. Since Kafka partitions topics, it naturally distributes tasks across multiple consumers, allowing the system to scale dynamically by adding more workers. Fault tolerance is further enhanced by configuring consumers to commit task offsets only after successful execution, preventing task duplication.

Compared to polling a database, Kafka provides lower latency, better scalability, and built-in fault tolerance. The event-driven nature of the system means that workers process tasks as soon as they are due rather than waiting for a scheduled query to retrieve them. This architecture is ideal for handling high-throughput workloads efficiently without overwhelming a relational database.