Requirements


Functional Requirements:

  1. Users should be able to schedule one-time tasks for execution at a specified future time
  2. Users should be able to schedule recurring tasks with intervals (every N minutes, daily, weekly, cron expressions)
  3. Users should be able to manage tasks (add, remove, pause, modify, check status)
  4. System should track task states (pending, running, completed, failed)
  5. System should retry failed tasks with exponential backoff (max 3-5 retries)
  6. Users should be notified of task failures (email, push, dashboard alerts)




Non-Functional Requirements:

Concurrency10,000 concurrent tasks at peak
Throughput100,000 tasks/day (~70 tasks/min)
Latency< 1 second from scheduled time to execution start
Precision1-second granularity, millisecond-level accuracy
AvailabilityHigh availability, horizontal scaling
ReliabilityNo missed tasks, exactly-once execution guarantee



REST is appropriate here - we have clear CRUD operations on resources (tasks).

Endpoints

POST /v1/tasks Headers: Authorization: Bearer <token>, Idempotency-Key: <uuid> Body: { "name": "Daily Report", "type": "RECURRING", "payload": { "webhook_url": "https://api.example.com/reports", "method": "POST", "body": { "report_type": "daily" } }, "cron_expression": "0 8 * * *", "end_date": "2026-12-31T00:00:00Z" } Response: 201 Created { "id": "task-uuid", "status": "SCHEDULED", "next_execution": "2026-02-10T08:00:00Z" } GET /v1/tasks Query: ?status=SCHEDULED&page=1&limit=20 Response: { "tasks": [...], "pagination": {...} } GET /v1/tasks/{task_id} Response: { "id": "...", "status": "...", "executions": [...] } PATCH /v1/tasks/{task_id} Body: { "status": "PAUSED" } or { "cron_expression": "0 9 * * *" } Response: 200 OK DELETE /v1/tasks/{task_id} Response: 204 No Content GET /v1/tasks/{task_id}/executions Query: ?status=FAILED&limit=10 Response: { "executions": [...] }

Idempotency

All state-changing operations require Idempotency-Key header to prevent duplicate task creation on network retries.



High-Level Design

Key Points:

  • Task API, Schedulers, and Status Updater are components of the same service (share PostgreSQL)
  • Workers are separate - only interact with Redis and Kafka, no database access
  • Status Updater consumes Kafka events and writes execution results to PostgreSQL
  • Redis coordinates partition assignment (scheduler:active set with heartbeat)
  • Schedulers dynamically discover each other and rebalance partitions
  • Fixed 256 partitions, distributed among active schedulers
  • Redis only stores near-term tasks (not all tasks = low memory)





Detailed Component Design

Component Responsibilities

Task Scheduler Service (single service, multiple components):

ComponentResponsibilities
Task APICRUD operations, validates cron expressions, writes to PostgreSQL, calculates partition_id
SchedulersRegister in Redis, heartbeat, dynamic partition assignment, poll DB for due tasks, push to Redis, requeue expired tasks
Status UpdaterKafka consumer, writes execution results to task_executions table, updates task status

Why Status Updater is inside Task Scheduler Service:

  • Workers don't have PostgreSQL access (only Redis + Kafka)
  • Task Scheduler Service owns the tasks and task_executions tables
  • Same bounded context - execution results belong to task management

Status Updater Flow:

func (su *StatusUpdater) Consume(ctx context.Context) { for msg := range su.kafka.Messages() { var event ExecutionEvent json.Unmarshal(msg.Value, &event) su.db.ExecContext(ctx, ` INSERT INTO task_executions (task_id, status, completed_at, error_message, attempt_number) VALUES ($1, $2, $3, $4, $5) `, event.TaskID, event.Status, event.CompletedAt, event.Error, event.Attempt) su.db.ExecContext(ctx, ` UPDATE tasks SET status = $1, updated_at = NOW() WHERE id = $2 `, event.Status, event.TaskID) } }

Workers (separate service, no database access):

  • Pull tasks from Redis when due (Lua script)
  • Get task payload from Redis hash
  • Execute webhooks with timeout
  • Handle retries with exponential backoff
  • Publish execution events to Kafka

Notification Service:

  • Consumes execution events from Kafka
  • Sends failure notifications via email/push
  • Manages user notification preferences

Redis Data Storage Strategy

Problem: Workers need task payload (webhook URL, parameters) but don't access PostgreSQL.

Solution: Scheduler pushes both queue entry and task data to Redis.

Redis KeyTypeOwnerPurpose
scheduler:activeSetSchedulerScheduler coordination (heartbeat)
queue:due_tasksSorted SetSharedTask timing (score = timestamp, member = task_id)
queue:task_data:{id}HashSharedTask payload (webhook_url, payload, retry_count)
queue:processingHashSharedVisibility timeout tracking
worker:executed:{id}:{attempt}StringWorkerAt-most-once delivery (SETNX before webhook)

Key Naming Convention:

  • scheduler: - Used only by Scheduler (internal coordination)
  • queue: - Shared contract between Scheduler and Workers
  • worker:* - Used only by Workers (internal state)

Order Matters Pattern (Sync Mitigation):

Write data first, queue second. This ensures a task is never queued without its data.

func (s *Scheduler) pushToRedis(ctx context.Context, task Task) error { dataKey := "queue:task_data:" + task.ID if err := s.rdb.HSet(ctx, dataKey, map[string]interface{}{ &quot;webhook_url&quot;: task.WebhookURL, &quot;payload&quot;: task.Payload, &quot;retry_count&quot;: task.RetryCount, }).Err(); err != nil { return fmt.Errorf(&quot;failed to write task data: %w&quot;, err) } s.rdb.Expire(ctx, dataKey, 24*time.Hour) if err := s.rdb.ZAdd(ctx, &quot;queue:due_tasks&quot;, &amp;redis.Z{ Score: float64(task.ScheduledAt.Unix()), Member: task.ID, }).Err(); err != nil { s.rdb.Del(ctx, dataKey) return fmt.Errorf(&quot;failed to queue task: %w&quot;, err) } return nil }

Failure Modes:

FailureStateImpactRecovery
HSET failsNothing writtenNoneRetry
ZADD failsData exists, not queuedOrphaned dataTTL cleanup (24h)
Crash between HSET and ZADDData exists, not queuedOrphaned dataTTL cleanup

Worker Claims with Data:

func (w *Worker) claimTask(ctx context.Context) (*Task, error) { taskID := w.claimTaskID(ctx) if taskID == "" { return nil, nil } data, err := w.rdb.HGetAll(ctx, &quot;queue:task_data:&quot;+taskID).Result() if err != nil || len(data) == 0 { log.Warn(&quot;task data missing&quot;, &quot;task_id&quot;, taskID) w.rdb.HDel(ctx, &quot;queue:processing&quot;, taskID) return nil, nil } return parseTask(taskID, data), nil }



Note: Worker only publishes events. Status Updater handles:

  • Recording execution in task_executions
  • Calculating next run from cron_expression
  • Updating scheduled_at for the next execution



Phase 6: Deep Dives (~15 min)

6.1 Resilience & Failure Handling

What happens if a Scheduler crashes?

  • Scheduler's heartbeat expires (30s TTL)
  • Other schedulers detect change in scheduler:active set
  • Partitions are automatically rebalanced among remaining schedulers
  • Kubernetes restarts the failed scheduler
  • New instance registers and triggers another rebalance
  • Idempotent operations - re-polling same tasks is safe (status check prevents duplicates)

Dynamic Partition Assignment (Redis-based):

func (s *Scheduler) Run(ctx context.Context) { // 1. Register in Redis with heartbeat s.rdb.SAdd(ctx, "scheduler:active", s.podName) go s.heartbeat(ctx) // Renew every 10s, TTL 30s // 2. Watch for scheduler changes and rebalance go s.watchAndRebalance(ctx) // 3. Start polling assigned partitions s.poll(ctx) } func (s *Scheduler) rebalance(schedulers []string) { sort.Strings(schedulers) myIndex := indexOf(schedulers, s.podName) totalSchedulers := len(schedulers) // Fixed 256 partitions, distributed by index s.partitions = []int{} for p := 0; p &lt; 256; p++ { if p % totalSchedulers == myIndex { s.partitions = append(s.partitions, p) } } }



What happens if Worker crashes mid-execution?

  • Task remains in Redis queue:processing hash with timeout
  • Visibility timeout: if not ACK'd in 30s, scheduler requeues the task
  • At-least-once delivery, idempotency on webhook side



Circuit Breaker for External Webhooks:

  • Track failure rate per webhook domain
  • If >50% failures in 1 minute, open circuit
  • Reject new executions for that domain for 30 seconds
  • Half-open: allow 1 test request, close if successful


6.2 Consistency & At-Least-Once Execution

Problem: How do we ensure webhook delivery without losing tasks?

Solution: Redis-based At-Least-Once Delivery

Mark execution in Redis AFTER sending webhook successfully. If crash occurs before marking, we retry (prefer duplicate over loss).


How It Works:

Eventexecution_idBehavior
First attempttask-123:1Key not exists, send webhook, mark complete
Crash after send, before marktask-123:1Key not exists, retry sends duplicate
Webhook failstask-123:1No key set, same attempt retries
Success after retrytask-123:1Key set, future retries skip

Trade-off:

AspectAt-Least-Once (our choice)At-Most-Once
DuplicatesPossible (crash after send)Prevented
Data lossNoPossible
Redis dependencyRequiredRequired
Receiver requirementMust be idempotentNo requirement

Why At-Least-Once?

  • Data loss is unacceptable (missed notifications, lost orders)
  • Duplicates can be handled by idempotent receivers (using X-Idempotency-Key)
  • Industry standard for critical webhook delivery




6.3 Scalability

Horizontal Scaling:

ComponentScaling Strategy
Task Scheduler Service (API + Schedulers + Status Updater)Scale together as one deployment, Redis coordinates partition rebalancing
WorkersScale independently, add more workers, Redis handles distribution
PostgreSQLPrimary for writes, read replicas optional
RedisCluster mode with partitioned queues if needed

Trade-off: Coupled Scaling

Task API, Schedulers, and Status Updater scale together because they're in the same service:

ProsCons
Simpler deployment (one artifact)Can't scale API independently from Schedulers
Shared PostgreSQL connection poolIf API needs 10 replicas but Scheduler needs 3, you get 10 of both
Same bounded contextResource waste if workloads differ significantly

Mitigation: For most workloads, this is acceptable. If API load vastly exceeds scheduling load, consider splitting into separate services later.



Scheduler Partition Query:

func (s Scheduler) poll(ctx context.Context) { for _, partition := range s.partitions { rows, _ := s.db.QueryContext(ctx, SELECT id, scheduled_at, payload FROM tasks WHERE status = &#39;SCHEDULED&#39; AND scheduled_at &lt; $1 AND partition_id = $2 LIMIT 100 , time.Now().Add(60time.Second), partition) // Push to Redis... } }

Notes:

  • Fixed 256 partitions (never changes, allows scaling to 256 schedulers)
  • partition_id is pre-calculated on task creation: hash(user_id) % 256
  • Each scheduler handles partitions where p % activeSchedulers == myIndex

Benefits:

  • No duplicate scheduling (each partition owned by exactly one scheduler)
  • Fully dynamic scaling (no manual configuration)
  • Automatic rebalancing on failure or scaling
  • Linear throughput scaling (more schedulers = more capacity)




6.4 Observability

Key Metrics:

MetricDescriptionAlert Threshold
tasks_scheduled_totalTasks created-
tasks_executed_totalSuccessful executions-
tasks_failed_totalFailed executions>1% failure rate
execution_delay_secondsscheduled_at vs actual start>5s p99
queue_depthTasks waiting in Redis>1000
worker_utilization% workers busy>80%

SLOs:

  • 99.9% of tasks execute within 5 seconds of scheduled time
  • 99.99% task execution success rate (excluding external failures)
  • 99.9% API availability

Distributed Tracing:

  • Trace ID propagated from API → Scheduler → Worker → Webhook
  • Allows debugging end-to-end execution path

Dead Letter Queue (DLQ):

Tasks that fail all retries are moved to the DLQ for manual inspection and resolution.




DLQ Operations:

OperationHow
View failedSELECT * FROM task_dlq WHERE resolved_at IS NULL
RetryRe-insert into tasks with status=SCHEDULED, mark DLQ resolved
ArchiveSet resolution='archived', resolved_at=NOW()

Retry from DLQ: