Requirements
Functional Requirements:
- Users should be able to schedule one-time tasks for execution at a specified future time
- Users should be able to schedule recurring tasks with intervals (every N minutes, daily, weekly, cron expressions)
- Users should be able to manage tasks (add, remove, pause, modify, check status)
- System should track task states (pending, running, completed, failed)
- System should retry failed tasks with exponential backoff (max 3-5 retries)
- Users should be notified of task failures (email, push, dashboard alerts)
Non-Functional Requirements:
| Concurrency | 10,000 concurrent tasks at peak |
| Throughput | 100,000 tasks/day (~70 tasks/min) |
| Latency | < 1 second from scheduled time to execution start |
| Precision | 1-second granularity, millisecond-level accuracy |
| Availability | High availability, horizontal scaling |
| Reliability | No missed tasks, exactly-once execution guarantee |
REST is appropriate here - we have clear CRUD operations on resources (tasks).
Endpoints
POST /v1/tasks
Headers: Authorization: Bearer <token>, Idempotency-Key: <uuid>
Body: {
"name": "Daily Report",
"type": "RECURRING",
"payload": {
"webhook_url": "https://api.example.com/reports",
"method": "POST",
"body": { "report_type": "daily" }
},
"cron_expression": "0 8 * * *",
"end_date": "2026-12-31T00:00:00Z"
}
Response: 201 Created
{
"id": "task-uuid",
"status": "SCHEDULED",
"next_execution": "2026-02-10T08:00:00Z"
}
GET /v1/tasks
Query: ?status=SCHEDULED&page=1&limit=20
Response: { "tasks": [...], "pagination": {...} }
GET /v1/tasks/{task_id}
Response: { "id": "...", "status": "...", "executions": [...] }
PATCH /v1/tasks/{task_id}
Body: { "status": "PAUSED" } or { "cron_expression": "0 9 * * *" }
Response: 200 OK
DELETE /v1/tasks/{task_id}
Response: 204 No Content
GET /v1/tasks/{task_id}/executions
Query: ?status=FAILED&limit=10
Response: { "executions": [...] }
Idempotency
All state-changing operations require Idempotency-Key header to prevent duplicate task creation on network retries.
High-Level Design
Key Points:
- Task API, Schedulers, and Status Updater are components of the same service (share PostgreSQL)
- Workers are separate - only interact with Redis and Kafka, no database access
- Status Updater consumes Kafka events and writes execution results to PostgreSQL
- Redis coordinates partition assignment (
scheduler:activeset with heartbeat) - Schedulers dynamically discover each other and rebalance partitions
- Fixed 256 partitions, distributed among active schedulers
- Redis only stores near-term tasks (not all tasks = low memory)
Detailed Component Design
Component Responsibilities
Task Scheduler Service (single service, multiple components):
| ComponentResponsibilities | |
| Task API | CRUD operations, validates cron expressions, writes to PostgreSQL, calculates partition_id |
| Schedulers | Register in Redis, heartbeat, dynamic partition assignment, poll DB for due tasks, push to Redis, requeue expired tasks |
| Status Updater | Kafka consumer, writes execution results to task_executions table, updates task status |
Why Status Updater is inside Task Scheduler Service:
- Workers don't have PostgreSQL access (only Redis + Kafka)
- Task Scheduler Service owns the
tasksandtask_executionstables - Same bounded context - execution results belong to task management
Status Updater Flow:
func (su *StatusUpdater) Consume(ctx context.Context) {
for msg := range su.kafka.Messages() {
var event ExecutionEvent
json.Unmarshal(msg.Value, &event)
su.db.ExecContext(ctx, `
INSERT INTO task_executions
(task_id, status, completed_at, error_message, attempt_number)
VALUES ($1, $2, $3, $4, $5)
`, event.TaskID, event.Status, event.CompletedAt,
event.Error, event.Attempt)
su.db.ExecContext(ctx, `
UPDATE tasks SET status = $1, updated_at = NOW() WHERE id = $2
`, event.Status, event.TaskID)
}
}
Workers (separate service, no database access):
- Pull tasks from Redis when due (Lua script)
- Get task payload from Redis hash
- Execute webhooks with timeout
- Handle retries with exponential backoff
- Publish execution events to Kafka
Notification Service:
- Consumes execution events from Kafka
- Sends failure notifications via email/push
- Manages user notification preferences
Redis Data Storage Strategy
Problem: Workers need task payload (webhook URL, parameters) but don't access PostgreSQL.
Solution: Scheduler pushes both queue entry and task data to Redis.
| Redis KeyTypeOwnerPurpose | |||
scheduler:active | Set | Scheduler | Scheduler coordination (heartbeat) |
queue:due_tasks | Sorted Set | Shared | Task timing (score = timestamp, member = task_id) |
queue:task_data:{id} | Hash | Shared | Task payload (webhook_url, payload, retry_count) |
queue:processing | Hash | Shared | Visibility timeout tracking |
worker:executed:{id}:{attempt} | String | Worker | At-most-once delivery (SETNX before webhook) |
Key Naming Convention:
scheduler:- Used only by Scheduler (internal coordination)queue:- Shared contract between Scheduler and Workersworker:*- Used only by Workers (internal state)
Order Matters Pattern (Sync Mitigation):
Write data first, queue second. This ensures a task is never queued without its data.
func (s *Scheduler) pushToRedis(ctx context.Context, task Task) error {
dataKey := "queue:task_data:" + task.ID
if err := s.rdb.HSet(ctx, dataKey, map[string]interface{}{
"webhook_url": task.WebhookURL,
"payload": task.Payload,
"retry_count": task.RetryCount,
}).Err(); err != nil {
return fmt.Errorf("failed to write task data: %w", err)
}
s.rdb.Expire(ctx, dataKey, 24*time.Hour)
if err := s.rdb.ZAdd(ctx, "queue:due_tasks", &redis.Z{
Score: float64(task.ScheduledAt.Unix()),
Member: task.ID,
}).Err(); err != nil {
s.rdb.Del(ctx, dataKey)
return fmt.Errorf("failed to queue task: %w", err)
}
return nil
}
Failure Modes:
| FailureStateImpactRecovery | |||
| HSET fails | Nothing written | None | Retry |
| ZADD fails | Data exists, not queued | Orphaned data | TTL cleanup (24h) |
| Crash between HSET and ZADD | Data exists, not queued | Orphaned data | TTL cleanup |
Worker Claims with Data:
func (w *Worker) claimTask(ctx context.Context) (*Task, error) {
taskID := w.claimTaskID(ctx)
if taskID == "" {
return nil, nil
}
data, err := w.rdb.HGetAll(ctx, "queue:task_data:"+taskID).Result()
if err != nil || len(data) == 0 {
log.Warn("task data missing", "task_id", taskID)
w.rdb.HDel(ctx, "queue:processing", taskID)
return nil, nil
}
return parseTask(taskID, data), nil
}
Note: Worker only publishes events. Status Updater handles:
- Recording execution in
task_executions - Calculating next run from
cron_expression - Updating
scheduled_atfor the next execution
Phase 6: Deep Dives (~15 min)
6.1 Resilience & Failure Handling
What happens if a Scheduler crashes?
- Scheduler's heartbeat expires (30s TTL)
- Other schedulers detect change in
scheduler:activeset - Partitions are automatically rebalanced among remaining schedulers
- Kubernetes restarts the failed scheduler
- New instance registers and triggers another rebalance
- Idempotent operations - re-polling same tasks is safe (status check prevents duplicates)
Dynamic Partition Assignment (Redis-based):
func (s *Scheduler) Run(ctx context.Context) {
// 1. Register in Redis with heartbeat
s.rdb.SAdd(ctx, "scheduler:active", s.podName)
go s.heartbeat(ctx) // Renew every 10s, TTL 30s
// 2. Watch for scheduler changes and rebalance
go s.watchAndRebalance(ctx)
// 3. Start polling assigned partitions
s.poll(ctx)
}
func (s *Scheduler) rebalance(schedulers []string) {
sort.Strings(schedulers)
myIndex := indexOf(schedulers, s.podName)
totalSchedulers := len(schedulers)
// Fixed 256 partitions, distributed by index
s.partitions = []int{}
for p := 0; p < 256; p++ {
if p % totalSchedulers == myIndex {
s.partitions = append(s.partitions, p)
}
}
}
What happens if Worker crashes mid-execution?
- Task remains in Redis
queue:processinghash with timeout - Visibility timeout: if not ACK'd in 30s, scheduler requeues the task
- At-least-once delivery, idempotency on webhook side
Circuit Breaker for External Webhooks:
- Track failure rate per webhook domain
- If >50% failures in 1 minute, open circuit
- Reject new executions for that domain for 30 seconds
- Half-open: allow 1 test request, close if successful
6.2 Consistency & At-Least-Once Execution
Problem: How do we ensure webhook delivery without losing tasks?
Solution: Redis-based At-Least-Once Delivery
Mark execution in Redis AFTER sending webhook successfully. If crash occurs before marking, we retry (prefer duplicate over loss).
How It Works:
| Eventexecution_idBehavior | ||
| First attempt | task-123:1 | Key not exists, send webhook, mark complete |
| Crash after send, before mark | task-123:1 | Key not exists, retry sends duplicate |
| Webhook fails | task-123:1 | No key set, same attempt retries |
| Success after retry | task-123:1 | Key set, future retries skip |
Trade-off:
| AspectAt-Least-Once (our choice)At-Most-Once | ||
| Duplicates | Possible (crash after send) | Prevented |
| Data loss | No | Possible |
| Redis dependency | Required | Required |
| Receiver requirement | Must be idempotent | No requirement |
Why At-Least-Once?
- Data loss is unacceptable (missed notifications, lost orders)
- Duplicates can be handled by idempotent receivers (using X-Idempotency-Key)
- Industry standard for critical webhook delivery
6.3 Scalability
Horizontal Scaling:
| ComponentScaling Strategy | |
| Task Scheduler Service (API + Schedulers + Status Updater) | Scale together as one deployment, Redis coordinates partition rebalancing |
| Workers | Scale independently, add more workers, Redis handles distribution |
| PostgreSQL | Primary for writes, read replicas optional |
| Redis | Cluster mode with partitioned queues if needed |
Trade-off: Coupled Scaling
Task API, Schedulers, and Status Updater scale together because they're in the same service:
| ProsCons | |
| Simpler deployment (one artifact) | Can't scale API independently from Schedulers |
| Shared PostgreSQL connection pool | If API needs 10 replicas but Scheduler needs 3, you get 10 of both |
| Same bounded context | Resource waste if workloads differ significantly |
Mitigation: For most workloads, this is acceptable. If API load vastly exceeds scheduling load, consider splitting into separate services later.
Scheduler Partition Query:
func (s Scheduler) poll(ctx context.Context) {
for _, partition := range s.partitions {
rows, _ := s.db.QueryContext(ctx, SELECT id, scheduled_at, payload FROM tasks WHERE status = 'SCHEDULED' AND scheduled_at < $1 AND partition_id = $2 LIMIT 100 , time.Now().Add(60time.Second), partition)
// Push to Redis...
}
}
Notes:
- Fixed 256 partitions (never changes, allows scaling to 256 schedulers)
partition_idis pre-calculated on task creation:hash(user_id) % 256- Each scheduler handles
partitions where p % activeSchedulers == myIndex
Benefits:
- No duplicate scheduling (each partition owned by exactly one scheduler)
- Fully dynamic scaling (no manual configuration)
- Automatic rebalancing on failure or scaling
- Linear throughput scaling (more schedulers = more capacity)
6.4 Observability
Key Metrics:
| MetricDescriptionAlert Threshold | ||
| tasks_scheduled_total | Tasks created | - |
| tasks_executed_total | Successful executions | - |
| tasks_failed_total | Failed executions | >1% failure rate |
| execution_delay_seconds | scheduled_at vs actual start | >5s p99 |
| queue_depth | Tasks waiting in Redis | >1000 |
| worker_utilization | % workers busy | >80% |
SLOs:
- 99.9% of tasks execute within 5 seconds of scheduled time
- 99.99% task execution success rate (excluding external failures)
- 99.9% API availability
Distributed Tracing:
- Trace ID propagated from API → Scheduler → Worker → Webhook
- Allows debugging end-to-end execution path
Dead Letter Queue (DLQ):
Tasks that fail all retries are moved to the DLQ for manual inspection and resolution.
DLQ Operations:
| OperationHow | |
| View failed | SELECT * FROM task_dlq WHERE resolved_at IS NULL |
| Retry | Re-insert into tasks with status=SCHEDULED, mark DLQ resolved |
| Archive | Set resolution='archived', resolved_at=NOW() |
Retry from DLQ: