Retry + Exponential Backoff
Failures in distributed systems are rarely permanent.
More often, they are momentary. A network packet is dropped. A database connection pool is briefly exhausted. A downstream service restarts and takes a few seconds to warm up.
In those moments, retrying is often the right move.
The problem begins when retries are immediate.
Imagine a service times out. The caller retries instantly. It times out again. The caller retries again. Multiply that behaviour across hundreds or thousands of instances and you have a retry storm. The downstream service, already struggling, is now flooded with repeated requests.
What started as a small outage becomes amplified by good intentions.
Retry with Exponential Backoff exists to make retries behave like patience instead of panic.
The Nature of Transient Failure
Transient failures are temporary by definition. They resolve with time. Retrying gives the system a chance to recover.
The question is not whether to retry. The question is how.
Immediate retries assume the system has already recovered. In many cases, that assumption is wrong. A service that is overloaded needs breathing room. A network partition needs time to heal.
Exponential backoff introduces increasing delays between attempts. The first retry may happen quickly. The second waits longer. The third longer still.
Time becomes part of the recovery strategy.
The Backoff Curve
A typical exponential backoff sequence might look like this:
- Attempt 1: wait 100ms
- Attempt 2: wait 200ms
- Attempt 3: wait 400ms
- Attempt 4: wait 800ms
Each delay doubles. The system spaces out its attempts instead of hammering the dependency.
Adding jitter — a small random variation — prevents many clients from retrying at exactly the same intervals. Without jitter, synchronized retries can recreate load spikes in waves.
Backoff with jitter turns coordinated stress into staggered recovery.
Protecting the Dependency
Exponential backoff is not only about being polite. It protects the caller as well.
If a dependency is unavailable, repeated immediate failures consume threads, sockets, and CPU time. Requests pile up. Latency increases. Resource usage climbs.
Spacing retries reduces contention. It limits how aggressively the system competes for a failing resource.
Over time, the retry pattern adapts to the shape of the outage.
Boundaries Matter
Retries should never be infinite.
Without a maximum retry limit, the system risks entering a prolonged loop that ties up resources indefinitely. A retry policy must define when to give up and surface an error.
Backoff increases latency. A request that might have succeeded instantly now waits progressively longer. That cost is acceptable when the alternative is collapse.
The goal is graceful degradation, not blind persistence.
Where It Fits
Retry with Exponential Backoff is appropriate for network calls, API requests, database connections, and any operation that can fail transiently.
It is less appropriate for operations that are unlikely to succeed without intervention. Retrying a malformed request does not improve its chances.
Understanding the difference between transient and permanent failure remains essential.
A Measured Response
Distributed systems benefit from measured behaviour.
Exponential backoff teaches the system to pause before trying again. It transforms a reflex into a strategy.
Outages happen.
The question is whether your retries make them worse or help them heal.