Webhook Retries: Backoff & Jitter
Webhook Retries: Backoff & Jitter
If you are looking for a webhook retries exponential backoff with jitter algorithm example, this guide gives you production-safe defaults. You will map status codes to retry behavior, apply backoff and jitter correctly, and harden your endpoint against duplicate deliveries.
The goal is simple: accept webhooks quickly, avoid retry storms, and keep downstream processing reliable even when dependencies are unstable.
Examples below use Laravel/PHP, but the retry and idempotency patterns are framework-agnostic.
SendPromptly delivery rules (what triggers retries)
2xx stops retries
A 2xx response means the delivery attempt succeeded for that endpoint, so retrying stops for that attempt chain. Keep this contract strict and predictable.
See How SendPromptly defines success and retries for the baseline behavior.
Non-2xx triggers retry attempts with exponential backoff
Any non-2xx status indicates the delivery did not complete successfully and should be retried on a schedule. This includes application errors and upstream outages.
Timeouts vs 5xx vs 4xx (recommended semantics)
Treat network timeouts and 5xx responses as transient by default. Treat most 4xx responses as contract or validation issues, unless the 4xx is intentionally temporary (for example, short-lived auth/rate-limit windows).
| Attempt result | Retry behavior | Endpoint recommendation |
|---|---|---|
2xx | Stop retries | Return immediately after enqueueing work |
Timeout or 5xx | Retry with backoff + jitter | Return non-2xx only when you cannot safely accept |
Permanent 4xx payload contract issue | Usually avoid repeated retries | Return 2xx, store for manual review, and fix mapping |
See it live: Trigger an event from your Sample Project, then open Message Log to watch attempt counts and outcomes.
Backoff + jitter (why it exists)
For a practical webhook retries exponential backoff with jitter algorithm example, use exponential delays with a max cap, then randomize each wait window to avoid synchronized re-delivery spikes.
Exponential backoff basics
Exponential backoff increases retry delay after each failed attempt (for example, doubling each time) so unhealthy endpoints get time to recover.
Jitter to prevent thundering herd
Without jitter, many failed deliveries retry at the same second and overwhelm recovering systems. With jitter, each retry is spread across a window.
webhook retry jitter full jitter vs equal jitter in practice:
Full jitter picks a random delay from 0..current_backoff, while equal jitter keeps half fixed and randomizes the rest. Full jitter usually smooths spikes better during large incident recoveries.
Example schedules (with caps)
A common webhook retry schedule 1m 2m 4m 8m with cap is: attempt 1 at ~1 minute, attempt 2 at ~2 minutes, attempt 3 at ~4 minutes, attempt 4 at ~8 minutes, then cap future delays (for example at 15 minutes) and apply jitter each time.
Suggested diagram/visual: a timeline with retries at 1m, 2m, 4m, 8m, then capped intervals, showing how jitter spreads each retry over a window instead of one exact timestamp.
What to implement on your endpoint
“Ack fast, process async”
Return success quickly after authenticity checks, then move heavy work to a queue. Do not block the webhook response on DB-heavy or third-party calls.
| |
Idempotency keys / dedupe table
Retries can deliver the same event more than once, so enforce idempotency in storage and workers. Use a stable dedupe key (event ID or a deterministic payload hash), and make downstream writes idempotent.
For ingestion behavior, see Ingestion idempotency (24-hour TTL) and why it matters.
Retry-aware logging (attempt number, latency)
Log attempt number, HTTP status, endpoint latency, and correlation identifiers so one delivery chain can be traced end-to-end without exposing secrets.
Optional transient classification helper:
| |
| |
Expected response: 200 with {"accepted":true} (or similar).
Run one live verification with Send a test event and inspect created delivery runs to confirm retries stop after 2xx.
When you should intentionally return non-2xx
Transient dependency outage (DB down, queue down)
If your app cannot durably accept the event (for example, DB unavailable and queue unavailable), return non-2xx so the event is retried later.
Use 429 rate_limited and other status codes to align response semantics with operational intent.
Permanent failure (bad payload contract) - return 2xx + store for manual review
If the payload is structurally valid but semantically unusable for your current contract, returning repeated 500 responses creates a retry loop with no chance of automatic recovery. A safer pattern is: acknowledge (2xx), store the payload with reason, alert, and resolve via replay after contract fixes.
Mini incident: A team returned 500 for a renamed field that would never parse in their old mapper. Retries kept firing for hours; switching to 2xx + manual review stopped the loop and protected queue capacity.
Troubleshooting “stuck in retries”
Use this section for webhook delivery stuck in retry loop troubleshooting when attempts keep climbing in Message Log.
Fast checklist
- Confirm your endpoint returns
2xxwithin a short timeout budget. - Verify no synchronous downstream calls happen before the response is sent.
- Check whether
429responses are self-inflicted by your own throttling rules. - Confirm dedupe/idempotency is active so retries do not multiply side effects.
- Trace one event across attempts using a stable correlation ID.
Observability signals to add
Track p50/p95 endpoint latency, non-2xx rate, timeout rate, retry attempt distribution, dedupe hit ratio, and queue delay for webhook jobs. These signals reveal whether failures are transport-level, contract-level, or downstream-capacity issues.
Common failure modes
- Slow endpoint (you do DB writes + API calls before responding) causes timeouts and retries.
- Returning
500for permanent payload issues causes endless retries for bad-contract data. - No idempotency causes duplicates when retries happen, especially after timeouts.
- Retry storm after your service recovers (no jitter or no queue smoothing on your side).
- Rate limiting your own webhook consumer incorrectly (
429) amplifies retries. - Missing correlation IDs means you cannot tie together attempts across logs.
Key takeaways
2xxmeans done for delivery; non-2xxmeans retry path.- Exponential backoff plus jitter protects both sender and receiver during incidents.
- Fast ack + async processing is the safest default for webhook receivers.
- Idempotency is mandatory because retries and duplicates are normal behavior.
- Distinguish transient failures from permanent contract failures to avoid retry loops.
Make retries harmless: Add idempotency + async processing, then re-test via Sample Project until Message Log shows clean 2xx success.