Monitoring & Alerting for Webhook Deliveries
Monitoring & Alerting for Webhook Deliveries (What to Alert On)
Alerting for webhook delivery should point engineers to action — not generate noise. This guide shows the signals to monitor, sane alert rules, and a Laravel implementation pattern you can ship today.
You’ll set up alerts on failure rate, retry backlog, latency/timeouts, and signature failures so on-call only sees actionable incidents.
The goal: alert only when action is needed
Avoid noisy alerts; surface sustained problems that require intervention.
Avoid noisy “single retry” alerts
Single retries are normal — alert on sustained or rate-based conditions.
Alert on sustained failure patterns
Alert when failures exceed a threshold for a sustained window or when retries backlog grows.
Common gotcha: Triggering global alerts for a single endpoint failure — scope alerts by endpoint.
Key signals to monitor
Monitor a small set of high-signal metrics and logs.
Failure rate per endpoint
Track 5xx and 4xx separately and alert on elevated 5xx.
Retry backlog (runs stuck in retrying)
Alert when the number of retrying runs exceeds a threshold or grows rapidly.
Latency / timeout rate (p95 / p99)
High p95/p99 latency or timeout spikes indicate downstream slowness.
4xx vs 5xx split (contract bug vs outage)
4xx → contract/credential issue. 5xx → receiver outage.
Signature verification failures (security signal)
Track invalid_signature to detect secret rotation or attack attempts.
Small comparison table:
| Signal | What it usually means | Pager? |
|---|---|---|
| Elevated 5xx | Receiver outage | Yes |
| Elevated 4xx | Contract/credential issue | No (assign to API owner) |
| Spike in timeouts | Network or app slowness | Yes |
| Signature failures | Secret rotation or spoofing | Yes |
Where to measure
Split responsibilities between consumer-side metrics and SendPromptly delivery logs.
In your consumer (first-party metrics)
Emit per-endpoint counters for accepted/deduped/errors/timeouts.
In SendPromptly delivery logs (ground truth for attempts)
Use Message Log as the source of truth for retries and delivery attempts.
Micro checklist:
- Emit
webhook.deliverylogs/metrics locally - Correlate delivery log events with your metrics
- Add endpoint dimension to alerts
Suggested alert rules (sane defaults)
- Endpoint failure rate > 5% for 5 minutes → Alert
- Consecutive failed attempts >= 10 for the same run → Alert
- Timeout rate > 1% or p95 latency > threshold → Alert
- Sudden spike in 401/403 → Secret mismatch alert
Use the Sample Project to simulate a failure and confirm the run shows retries in Message Log.
Implementation in Laravel
Emit structured logs
| |
Increment counters per outcome
Use a metrics client (Prometheus/StatsD) to increment webhook.delivery.{outcome} with endpoint and correlation_id tags.
Common gotcha: Alerting on any retry — retries are normal; alert on sustained patterns instead.
Incident runbook
- Use filters to find the run → inspect last attempt.
- Decide: replay vs fix endpoint vs roll back change.
- If replaying, confirm idempotency and signature verification.
Use filters to find the run → inspect attempt → decide replay vs fix
Keep a runbook link in every alert so responders immediately open Message Log and the relevant run.
Test steps
- Generate a failure to validate alert pipeline
- Temporarily return
500from your consumer. - Trigger one event.
- Confirm retries appear (delivery attempts continue until success).
- Generate an invalid signature
- Send a webhook request without valid signature headers.
- Expected:
401 invalid_signature - Confirm your log/metric pipeline captures
webhook.invalid_signatureand you can alert on spikes.
Common failure modes
- Alerting on any retry (noisy; retries are normal).
- No separation of 4xx vs 5xx (different owners: contract vs ops).
- Not tracking latency/timeouts (timeouts look like random failures).
- No endpoint dimension (one partner endpoint is failing, but alerts look global).
- Ignoring signature failures (security regression or secret rotation mismatch).
- No runbook link in alerts (engineers waste time deciding what to do).
Internal links + CTA
Delivery fundamentals for alert design: webhook success is HTTP 2xx, non-2xx triggers retries via exponential backoff; webhooks are signed (
X-SP-Timestamp,X-SP-Signature) and should be verified.
- Delivery logs & debugging overview
- Retries, success rules, signature headers
- Find failing runs fast
- Replay after fix
- Trace alerts to the exact run
Key takeaways
- Alert on sustained failure patterns, not single retries.
- Separate 4xx (contract) from 5xx (outage) in alerts.
- Monitor retry backlog, latency (p95/p99), and signature failures.
- Add endpoint dimensions and a runbook link to every alert.
When an alert fires, go straight to Message Log → filter failed/retrying → inspect the last attempt → replay after fix.