Design A Task Scheduler - System Design

Requirements

Functional Requirements:

Users should be able to schedule one-time tasks for execution at a specified future time
Users should be able to schedule recurring tasks with intervals (every N minutes, daily, weekly, cron expressions)
Users should be able to manage tasks (add, remove, pause, modify, check status)
System should track task states (pending, running, completed, failed)
System should retry failed tasks with exponential backoff (max 3-5 retries)
Users should be notified of task failures (email, push, dashboard alerts)

Non-Functional Requirements:

Concurrency	10,000 concurrent tasks at peak
Throughput	100,000 tasks/day (~70 tasks/min)
Latency	< 1 second from scheduled time to execution start
Precision	1-second granularity, millisecond-level accuracy
Availability	High availability, horizontal scaling
Reliability	No missed tasks, exactly-once execution guarantee

REST is appropriate here - we have clear CRUD operations on resources (tasks).

Endpoints

POST /v1/tasks
Headers: Authorization: Bearer <token>, Idempotency-Key: <uuid>
Body: {
  "name": "Daily Report",
  "type": "RECURRING",
  "payload": {
    "webhook_url": "https://api.example.com/reports",
    "method": "POST",
    "body": { "report_type": "daily" }
  },
  "cron_expression": "0 8 * * *",
  "end_date": "2026-12-31T00:00:00Z"
}
Response: 201 Created
{
  "id": "task-uuid",
  "status": "SCHEDULED",
  "next_execution": "2026-02-10T08:00:00Z"
}
GET /v1/tasks
Query: ?status=SCHEDULED&page=1&limit=20
Response: { "tasks": [...], "pagination": {...} }
GET /v1/tasks/{task_id}
Response: { "id": "...", "status": "...", "executions": [...] }
PATCH /v1/tasks/{task_id}
Body: { "status": "PAUSED" } or { "cron_expression": "0 9 * * *" }
Response: 200 OK
DELETE /v1/tasks/{task_id}
Response: 204 No Content
GET /v1/tasks/{task_id}/executions
Query: ?status=FAILED&limit=10
Response: { "executions": [...] }

Idempotency

All state-changing operations require Idempotency-Key header to prevent duplicate task creation on network retries.

High-Level Design

Key Points:

Task API, Schedulers, and Status Updater are components of the same service (share PostgreSQL)
Workers are separate - only interact with Redis and Kafka, no database access
Status Updater consumes Kafka events and writes execution results to PostgreSQL
Redis coordinates partition assignment (scheduler:active set with heartbeat)
Schedulers dynamically discover each other and rebalance partitions
Fixed 256 partitions, distributed among active schedulers
Redis only stores near-term tasks (not all tasks = low memory)

Detailed Component Design

Component Responsibilities

Task Scheduler Service (single service, multiple components):

ComponentResponsibilities
Task API	CRUD operations, validates cron expressions, writes to PostgreSQL, calculates partition_id
Schedulers	Register in Redis, heartbeat, dynamic partition assignment, poll DB for due tasks, push to Redis, requeue expired tasks
Status Updater	Kafka consumer, writes execution results to task_executions table, updates task status

Why Status Updater is inside Task Scheduler Service:

Workers don't have PostgreSQL access (only Redis + Kafka)
Task Scheduler Service owns the tasks and task_executions tables
Same bounded context - execution results belong to task management

Status Updater Flow:

func (su *StatusUpdater) Consume(ctx context.Context) {
    for msg := range su.kafka.Messages() {
        var event ExecutionEvent
        json.Unmarshal(msg.Value, &event)
    su.db.ExecContext(ctx, `
        INSERT INTO task_executions 
        (task_id, status, completed_at, error_message, attempt_number)
        VALUES ($1, $2, $3, $4, $5)
    `, event.TaskID, event.Status, event.CompletedAt, 
       event.Error, event.Attempt)
    
    su.db.ExecContext(ctx, `
        UPDATE tasks SET status = $1, updated_at = NOW() WHERE id = $2
    `, event.Status, event.TaskID)
}

}

Workers (separate service, no database access):

Pull tasks from Redis when due (Lua script)
Get task payload from Redis hash
Execute webhooks with timeout
Handle retries with exponential backoff
Publish execution events to Kafka

Notification Service:

Consumes execution events from Kafka
Sends failure notifications via email/push
Manages user notification preferences

Redis Data Storage Strategy

Problem: Workers need task payload (webhook URL, parameters) but don't access PostgreSQL.

Solution: Scheduler pushes both queue entry and task data to Redis.

Redis KeyTypeOwnerPurpose
`scheduler:active`	Set	Scheduler	Scheduler coordination (heartbeat)
`queue:due_tasks`	Sorted Set	Shared	Task timing (score = timestamp, member = task_id)
`queue:task_data:{id}`	Hash	Shared	Task payload (webhook_url, payload, retry_count)
`queue:processing`	Hash	Shared	Visibility timeout tracking
`worker:executed:{id}:{attempt}`	String	Worker	At-most-once delivery (SETNX before webhook)

Key Naming Convention:

scheduler: - Used only by Scheduler (internal coordination)
queue: - Shared contract between Scheduler and Workers
worker:* - Used only by Workers (internal state)

Order Matters Pattern (Sync Mitigation):

Write data first, queue second. This ensures a task is never queued without its data.

func (s *Scheduler) pushToRedis(ctx context.Context, task Task) error {
dataKey := "queue:task_data:" + task.ID
if err := s.rdb.HSet(ctx, dataKey, map[string]interface{}{
    &quot;webhook_url&quot;: task.WebhookURL,
    &quot;payload&quot;:     task.Payload,
    &quot;retry_count&quot;: task.RetryCount,
}).Err(); err != nil {
    return fmt.Errorf(&quot;failed to write task data: %w&quot;, err)
}
s.rdb.Expire(ctx, dataKey, 24*time.Hour)

if err := s.rdb.ZAdd(ctx, &quot;queue:due_tasks&quot;, &amp;redis.Z{
    Score:  float64(task.ScheduledAt.Unix()),
    Member: task.ID,
}).Err(); err != nil {
    s.rdb.Del(ctx, dataKey)
    return fmt.Errorf(&quot;failed to queue task: %w&quot;, err)
}

return nil

}

Failure Modes:

FailureStateImpactRecovery
HSET fails	Nothing written	None	Retry
ZADD fails	Data exists, not queued	Orphaned data	TTL cleanup (24h)
Crash between HSET and ZADD	Data exists, not queued	Orphaned data	TTL cleanup

Worker Claims with Data:

func (w *Worker) claimTask(ctx context.Context) (*Task, error) {
taskID := w.claimTaskID(ctx)
if taskID == "" {
return nil, nil
}
data, err := w.rdb.HGetAll(ctx, &quot;queue:task_data:&quot;+taskID).Result()
if err != nil || len(data) == 0 {
    log.Warn(&quot;task data missing&quot;, &quot;task_id&quot;, taskID)
    w.rdb.HDel(ctx, &quot;queue:processing&quot;, taskID)
    return nil, nil
}

return parseTask(taskID, data), nil

}

Note: Worker only publishes events. Status Updater handles:

Recording execution in task_executions
Calculating next run from cron_expression
Updating scheduled_at for the next execution

Phase 6: Deep Dives (~15 min)

6.1 Resilience & Failure Handling

What happens if a Scheduler crashes?

Scheduler's heartbeat expires (30s TTL)
Other schedulers detect change in scheduler:active set
Partitions are automatically rebalanced among remaining schedulers
Kubernetes restarts the failed scheduler
New instance registers and triggers another rebalance
Idempotent operations - re-polling same tasks is safe (status check prevents duplicates)

Dynamic Partition Assignment (Redis-based):

func (s *Scheduler) Run(ctx context.Context) {
// 1. Register in Redis with heartbeat
s.rdb.SAdd(ctx, "scheduler:active", s.podName)
go s.heartbeat(ctx)  // Renew every 10s, TTL 30s
// 2. Watch for scheduler changes and rebalance
go s.watchAndRebalance(ctx)

// 3. Start polling assigned partitions
s.poll(ctx)

}
func (s *Scheduler) rebalance(schedulers []string) {
sort.Strings(schedulers)
myIndex := indexOf(schedulers, s.podName)
totalSchedulers := len(schedulers)
// Fixed 256 partitions, distributed by index
s.partitions = []int{}
for p := 0; p &lt; 256; p++ {
    if p % totalSchedulers == myIndex {
        s.partitions = append(s.partitions, p)
    }
}

}

What happens if Worker crashes mid-execution?

Task remains in Redis queue:processing hash with timeout
Visibility timeout: if not ACK'd in 30s, scheduler requeues the task
At-least-once delivery, idempotency on webhook side

Circuit Breaker for External Webhooks:

Track failure rate per webhook domain
If >50% failures in 1 minute, open circuit
Reject new executions for that domain for 30 seconds
Half-open: allow 1 test request, close if successful

6.2 Consistency & At-Least-Once Execution

Problem: How do we ensure webhook delivery without losing tasks?

Solution: Redis-based At-Least-Once Delivery

Mark execution in Redis AFTER sending webhook successfully. If crash occurs before marking, we retry (prefer duplicate over loss).

How It Works:

Eventexecution_idBehavior
First attempt	task-123:1	Key not exists, send webhook, mark complete
Crash after send, before mark	task-123:1	Key not exists, retry sends duplicate
Webhook fails	task-123:1	No key set, same attempt retries
Success after retry	task-123:1	Key set, future retries skip

Trade-off:

AspectAt-Least-Once (our choice)At-Most-Once
Duplicates	Possible (crash after send)	Prevented
Data loss	No	Possible
Redis dependency	Required	Required
Receiver requirement	Must be idempotent	No requirement

Why At-Least-Once?

Data loss is unacceptable (missed notifications, lost orders)
Duplicates can be handled by idempotent receivers (using X-Idempotency-Key)
Industry standard for critical webhook delivery

6.3 Scalability

Horizontal Scaling:

ComponentScaling Strategy
Task Scheduler Service (API + Schedulers + Status Updater)	Scale together as one deployment, Redis coordinates partition rebalancing
Workers	Scale independently, add more workers, Redis handles distribution
PostgreSQL	Primary for writes, read replicas optional
Redis	Cluster mode with partitioned queues if needed

Trade-off: Coupled Scaling

Task API, Schedulers, and Status Updater scale together because they're in the same service:

ProsCons
Simpler deployment (one artifact)	Can't scale API independently from Schedulers
Shared PostgreSQL connection pool	If API needs 10 replicas but Scheduler needs 3, you get 10 of both
Same bounded context	Resource waste if workloads differ significantly

Mitigation: For most workloads, this is acceptable. If API load vastly exceeds scheduling load, consider splitting into separate services later.

Scheduler Partition Query:

func (s Scheduler) poll(ctx context.Context) {
for _, partition := range s.partitions {
rows, _ := s.db.QueryContext(ctx,             SELECT id, scheduled_at, payload              FROM tasks              WHERE status = &#39;SCHEDULED&#39;             AND scheduled_at &lt; $1             AND partition_id = $2             LIMIT 100        , time.Now().Add(60time.Second), partition)
    // Push to Redis...
}

}

Notes:

Fixed 256 partitions (never changes, allows scaling to 256 schedulers)
partition_id is pre-calculated on task creation: hash(user_id) % 256
Each scheduler handles partitions where p % activeSchedulers == myIndex

Benefits:

No duplicate scheduling (each partition owned by exactly one scheduler)
Fully dynamic scaling (no manual configuration)
Automatic rebalancing on failure or scaling
Linear throughput scaling (more schedulers = more capacity)

6.4 Observability

Key Metrics:

MetricDescriptionAlert Threshold
tasks_scheduled_total	Tasks created	-
tasks_executed_total	Successful executions	-
tasks_failed_total	Failed executions	>1% failure rate
execution_delay_seconds	scheduled_at vs actual start	>5s p99
queue_depth	Tasks waiting in Redis	>1000
worker_utilization	% workers busy	>80%

SLOs:

99.9% of tasks execute within 5 seconds of scheduled time
99.99% task execution success rate (excluding external failures)
99.9% API availability

Distributed Tracing:

Trace ID propagated from API → Scheduler → Worker → Webhook
Allows debugging end-to-end execution path

Dead Letter Queue (DLQ):

Tasks that fail all retries are moved to the DLQ for manual inspection and resolution.

DLQ Operations:

OperationHow
View failed	`SELECT * FROM task_dlq WHERE resolved_at IS NULL`
Retry	Re-insert into tasks with status=SCHEDULED, mark DLQ resolved
Archive	Set resolution='archived', resolved_at=NOW()

Retry from DLQ: