Retry policy in CompletionService
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
CompletionService is useful when you want to submit many tasks and consume results as they finish instead of waiting in submission order. Adding retries to that pattern is possible, but the retry logic should be designed explicitly rather than bolted onto Future.get() calls. The key is to treat retryable failure as part of task orchestration, not as an afterthought in the result-reading loop.
Core Sections
What CompletionService Actually Gives You
CompletionService combines an executor with a completion queue. You submit tasks, and completed futures become available in completion order.
Typical setup:
This is useful for workloads where some tasks finish quickly and you want to start handling them immediately.
Where Retry Logic Should Live
The cleanest design is usually to wrap the task itself with retry behavior. That way each submitted unit either:
- succeeds within the retry policy
- fails permanently after exhausting retries
A simple retrying Callable:
This keeps retry policy close to the work being retried.
Submitting Tasks and Reading Results
Use the completion service normally:
Tasks that succeed after retries appear just like normal successful tasks. Tasks that exhaust retries surface as failed futures.
Retry Outside the Task When Needed
Sometimes you want central orchestration rather than embedding retry logic in each task. For example, you may want to inspect the exception type and resubmit only certain failures.
A simplified idea:
- submit original callable
- consume completed future
- if failure is retryable and retry budget remains, submit a new callable
- otherwise record permanent failure
This approach is more flexible, but you must track outstanding task count carefully so the result loop knows when it is really done.
Distinguish Retryable and Permanent Failures
Not every exception should be retried. Good candidates:
- transient network timeout
- temporary remote throttling
- short-lived database connectivity issue
Bad candidates:
- validation errors
- malformed requests
- deterministic logic bugs
A retry policy without exception classification often amplifies failure instead of improving resilience.
Add Backoff and Limits
Retries without delay can overload the failing dependency. Add:
- maximum attempts
- delay or exponential backoff
- optional jitter
For example, backoff based on attempt number:
In real systems, jitter helps avoid synchronized retry storms across many worker threads.
Cancellation and Shutdown Semantics
If the executor is shutting down or the calling thread is interrupted, honor interruption correctly. A retry loop that ignores interruption can hang application shutdown.
That means:
Do not silently swallow interruptions inside retry code.
When a Library Is Better
If retry policy becomes complex, such as needing jitter, circuit breaking, metrics, and exception classification, consider using a resilience library instead of hand-writing everything. CompletionService still coordinates the concurrency, but resilience policy can be delegated to a library layer.
Common Pitfalls
- Retrying every exception instead of only retryable failures.
- Adding retries in the result loop without tracking resubmitted task counts correctly.
- Omitting backoff and creating retry storms under load.
- Ignoring interruption and making executor shutdown unreliable.
- Hiding permanent failures by retrying until the system times out elsewhere.
Summary
- '
CompletionServiceis good for consuming task results in completion order.' - The simplest retry design is often a retrying
Callablesubmitted to the service. - Retries need limits, backoff, and exception classification.
- Distinguish transient failures from permanent ones before resubmitting work.
- Treat shutdown and interruption as first-class parts of the retry design.

