Webhook Retries: Backoff & Jitter

If you are looking for a webhook retries exponential backoff with jitter algorithm example, this guide gives you production-safe defaults. You will map status codes to retry behavior, apply backoff and jitter correctly, and harden your endpoint against duplicate deliveries.

The goal is simple: accept webhooks quickly, avoid retry storms, and keep downstream processing reliable even when dependencies are unstable.

Examples below use Laravel/PHP, but the retry and idempotency patterns are framework-agnostic.

SendPromptly delivery rules (what triggers retries)

2xx stops retries

A 2xx response means the delivery attempt succeeded for that endpoint, so retrying stops for that attempt chain. Keep this contract strict and predictable.

See How SendPromptly defines success and retries for the baseline behavior.

Non-2xx triggers retry attempts with exponential backoff

Any non-2xx status indicates the delivery did not complete successfully and should be retried on a schedule. This includes application errors and upstream outages.

Timeouts vs 5xx vs 4xx (recommended semantics)

Treat network timeouts and 5xx responses as transient by default. Treat most 4xx responses as contract or validation issues, unless the 4xx is intentionally temporary (for example, short-lived auth/rate-limit windows).

Attempt result	Retry behavior	Endpoint recommendation
`2xx`	Stop retries	Return immediately after enqueueing work
Timeout or `5xx`	Retry with backoff + jitter	Return non-`2xx` only when you cannot safely accept
Permanent `4xx` payload contract issue	Usually avoid repeated retries	Return `2xx`, store for manual review, and fix mapping

See it live: Trigger an event from your Sample Project, then open Message Log to watch attempt counts and outcomes.

Backoff + jitter (why it exists)

For a practical webhook retries exponential backoff with jitter algorithm example, use exponential delays with a max cap, then randomize each wait window to avoid synchronized re-delivery spikes.

Exponential backoff basics

Exponential backoff increases retry delay after each failed attempt (for example, doubling each time) so unhealthy endpoints get time to recover.

Jitter to prevent thundering herd

Without jitter, many failed deliveries retry at the same second and overwhelm recovering systems. With jitter, each retry is spread across a window.

webhook retry jitter full jitter vs equal jitter in practice: Full jitter picks a random delay from 0..current_backoff, while equal jitter keeps half fixed and randomizes the rest. Full jitter usually smooths spikes better during large incident recoveries.

Example schedules (with caps)

A common webhook retry schedule 1m 2m 4m 8m with cap is: attempt 1 at ~1 minute, attempt 2 at ~2 minutes, attempt 3 at ~4 minutes, attempt 4 at ~8 minutes, then cap future delays (for example at 15 minutes) and apply jitter each time.

Suggested diagram/visual: a timeline with retries at 1m, 2m, 4m, 8m, then capped intervals, showing how jitter spreads each retry over a window instead of one exact timestamp.

What to implement on your endpoint

“Ack fast, process async”

Return success quickly after authenticity checks, then move heavy work to a queue. Do not block the webhook response on DB-heavy or third-party calls.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
use Illuminate\Http\Request;
use Illuminate\Support\Facades\Bus;

Route::post('/webhooks/sendpromptly', function (Request $request) {
    // validate signature middleware runs before this
    $payload = $request->json()->all();

    // dispatch job (do not block webhook response on downstream calls)
    Bus::dispatch(new \App\Jobs\ProcessSendPromptlyWebhook($payload));

    return response()->json(['accepted' => true], 200);
});

Idempotency keys / dedupe table

Retries can deliver the same event more than once, so enforce idempotency in storage and workers. Use a stable dedupe key (event ID or a deterministic payload hash), and make downstream writes idempotent.

For ingestion behavior, see Ingestion idempotency (24-hour TTL) and why it matters.

Retry-aware logging (attempt number, latency)

Log attempt number, HTTP status, endpoint latency, and correlation identifiers so one delivery chain can be traced end-to-end without exposing secrets.

Optional transient classification helper:

1
2
3
4
5
6
7
8
9
final class WebhookFailure
{
    public static function isTransient(\Throwable $e): bool
    {
        return $e instanceof \Illuminate\Database\QueryException
            || $e instanceof \RedisException
            || str_contains($e->getMessage(), 'timeout');
    }
}

1
2
3
4
5
curl -i -X POST "http://localhost:8000/webhooks/sendpromptly" \
  -H "Content-Type: application/json" \
  -H "X-SP-Timestamp: 1700000000" \
  -H "X-SP-Signature: <valid_signature_here>" \
  --data '{"event_key":"order.created","payload":{"order_id":"O-1001"}}'

Expected response: 200 with {"accepted":true} (or similar).

Run one live verification with Send a test event and inspect created delivery runs to confirm retries stop after 2xx.

When you should intentionally return non-2xx

Transient dependency outage (DB down, queue down)

If your app cannot durably accept the event (for example, DB unavailable and queue unavailable), return non-2xx so the event is retried later.

Use 429 rate_limited and other status codes to align response semantics with operational intent.

Permanent failure (bad payload contract) - return 2xx + store for manual review

If the payload is structurally valid but semantically unusable for your current contract, returning repeated 500 responses creates a retry loop with no chance of automatic recovery. A safer pattern is: acknowledge (2xx), store the payload with reason, alert, and resolve via replay after contract fixes.

Mini incident: A team returned 500 for a renamed field that would never parse in their old mapper. Retries kept firing for hours; switching to 2xx + manual review stopped the loop and protected queue capacity.

Troubleshooting “stuck in retries”

Use this section for webhook delivery stuck in retry loop troubleshooting when attempts keep climbing in Message Log.

Fast checklist

Confirm your endpoint returns 2xx within a short timeout budget.
Verify no synchronous downstream calls happen before the response is sent.
Check whether 429 responses are self-inflicted by your own throttling rules.
Confirm dedupe/idempotency is active so retries do not multiply side effects.
Trace one event across attempts using a stable correlation ID.

Observability signals to add

Track p50/p95 endpoint latency, non-2xx rate, timeout rate, retry attempt distribution, dedupe hit ratio, and queue delay for webhook jobs. These signals reveal whether failures are transport-level, contract-level, or downstream-capacity issues.

Common failure modes

Slow endpoint (you do DB writes + API calls before responding) causes timeouts and retries.
Returning 500 for permanent payload issues causes endless retries for bad-contract data.
No idempotency causes duplicates when retries happen, especially after timeouts.
Retry storm after your service recovers (no jitter or no queue smoothing on your side).
Rate limiting your own webhook consumer incorrectly (429) amplifies retries.
Missing correlation IDs means you cannot tie together attempts across logs.

Key takeaways

2xx means done for delivery; non-2xx means retry path.
Exponential backoff plus jitter protects both sender and receiver during incidents.
Fast ack + async processing is the safest default for webhook receivers.
Idempotency is mandatory because retries and duplicates are normal behavior.
Distinguish transient failures from permanent contract failures to avoid retry loops.

Make retries harmless: Add idempotency + async processing, then re-test via Sample Project until Message Log shows clean 2xx success.