Monitoring & Alerting for Webhook Deliveries

Monitoring & Alerting for Webhook Deliveries (What to Alert On)

Alerting for webhook delivery should point engineers to action — not generate noise. This guide shows the signals to monitor, sane alert rules, and a Laravel implementation pattern you can ship today.

You’ll set up alerts on failure rate, retry backlog, latency/timeouts, and signature failures so on-call only sees actionable incidents.

The goal: alert only when action is needed

Avoid noisy alerts; surface sustained problems that require intervention.

Avoid noisy “single retry” alerts

Single retries are normal — alert on sustained or rate-based conditions.

Alert on sustained failure patterns

Alert when failures exceed a threshold for a sustained window or when retries backlog grows.

Common gotcha: Triggering global alerts for a single endpoint failure — scope alerts by endpoint.

Key signals to monitor

Monitor a small set of high-signal metrics and logs.

Failure rate per endpoint

Track 5xx and 4xx separately and alert on elevated 5xx.

Retry backlog (runs stuck in retrying)

Alert when the number of retrying runs exceeds a threshold or grows rapidly.

Latency / timeout rate (p95 / p99)

High p95/p99 latency or timeout spikes indicate downstream slowness.

4xx vs 5xx split (contract bug vs outage)

4xx → contract/credential issue. 5xx → receiver outage.

Signature verification failures (security signal)

Track invalid_signature to detect secret rotation or attack attempts.

Small comparison table:

Signal	What it usually means	Pager?
Elevated 5xx	Receiver outage	Yes
Elevated 4xx	Contract/credential issue	No (assign to API owner)
Spike in timeouts	Network or app slowness	Yes
Signature failures	Secret rotation or spoofing	Yes

Where to measure

Split responsibilities between consumer-side metrics and SendPromptly delivery logs.

In your consumer (first-party metrics)

Emit per-endpoint counters for accepted/deduped/errors/timeouts.

In SendPromptly delivery logs (ground truth for attempts)

Use Message Log as the source of truth for retries and delivery attempts.

Micro checklist:

Emit webhook.delivery logs/metrics locally
Correlate delivery log events with your metrics
Add endpoint dimension to alerts

Suggested alert rules (sane defaults)

Endpoint failure rate > 5% for 5 minutes → Alert
Consecutive failed attempts >= 10 for the same run → Alert
Timeout rate > 1% or p95 latency > threshold → Alert
Sudden spike in 401/403 → Secret mismatch alert

Use the Sample Project to simulate a failure and confirm the run shows retries in Message Log.

Implementation in Laravel

Emit structured logs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
$start = microtime(true);

try {
  // verify signature + queue
  $status = 200;
  $outcome = 'accepted';
} catch (\Throwable $e) {
  $status = 500;
  $outcome = 'error';
}

$ms = (int) ((microtime(true) - $start) * 1000);

\Log::info('webhook.delivery', [
  'endpoint' => 'sendpromptly',
  'outcome' => $outcome,       // accepted | deduped | invalid_signature | error
  'http_status' => $status,
  'latency_ms' => $ms,
  'correlation_id' => $cid ?? null,
]);

Increment counters per outcome

Use a metrics client (Prometheus/StatsD) to increment webhook.delivery.{outcome} with endpoint and correlation_id tags.

Common gotcha: Alerting on any retry — retries are normal; alert on sustained patterns instead.

Incident runbook

Use filters to find the run → inspect last attempt.
Decide: replay vs fix endpoint vs roll back change.
If replaying, confirm idempotency and signature verification.

Use filters to find the run → inspect attempt → decide replay vs fix

Keep a runbook link in every alert so responders immediately open Message Log and the relevant run.

Test steps

Generate a failure to validate alert pipeline

Temporarily return 500 from your consumer.
Trigger one event.
Confirm retries appear (delivery attempts continue until success).

Generate an invalid signature

Send a webhook request without valid signature headers.
Expected: 401 invalid_signature
Confirm your log/metric pipeline captures webhook.invalid_signature and you can alert on spikes.

Common failure modes

Alerting on any retry (noisy; retries are normal).
No separation of 4xx vs 5xx (different owners: contract vs ops).
Not tracking latency/timeouts (timeouts look like random failures).
No endpoint dimension (one partner endpoint is failing, but alerts look global).
Ignoring signature failures (security regression or secret rotation mismatch).
No runbook link in alerts (engineers waste time deciding what to do).

Internal links + CTA

Delivery fundamentals for alert design: webhook success is HTTP 2xx, non-2xx triggers retries via exponential backoff; webhooks are signed (X-SP-Timestamp, X-SP-Signature) and should be verified.

Key takeaways

Alert on sustained failure patterns, not single retries.
Separate 4xx (contract) from 5xx (outage) in alerts.
Monitor retry backlog, latency (p95/p99), and signature failures.
Add endpoint dimensions and a runbook link to every alert.

When an alert fires, go straight to Message Log → filter failed/retrying → inspect the last attempt → replay after fix.