← BACK TO JOURNAL/ENGINEERING/POST 005

Why we built our own
webhook relay.

Signed payloads, exponential backoff, replay protection — and why the existing services didn't fit our latency budget.

FIG.08 — WEBHOOK RELAY ARCHITECTURE

Webhook delivery is one of those problems that looks trivial on a whiteboard and produces a postmortem the first time a customer's endpoint goes down for an afternoon. We wrote our own relay last quarter and it's been the highest-leverage piece of infrastructure work we've done all year.

What we needed

  1. Signed payloads. HMAC-SHA256 with a per-tenant secret. Standard, but easy to get wrong if you sign the wrong bytes.
  2. Retries with exponential backoff. Customer endpoints fail. We retry with jitter, cap the backoff at four hours, and give up after 48 hours with a dead-letter queue.
  3. Replay protection. Every payload includes a timestamp and a nonce. Customers can reject anything older than five minutes.
  4. At-least-once delivery. Idempotency keys included in every payload. Customers de-duplicate on their side.
  5. Per-tenant isolation. A slow customer endpoint can't queue head-of-line block other tenants. Each tenant has its own worker pool with bounded concurrency.

Why not a third-party service

We looked at hosted webhook-delivery services. They're good products, and we'd recommend them to teams that don't want to run their own relay. The reason we built ours in-house instead was the extra network hop — every additional service we put in the delivery path was time we couldn't get back, on what is fundamentally a fan-out of one HTTP request per submission.

There's also the supply-chain dimension. Webhook delivery is the most security-critical part of our pipeline besides authentication. We wanted to own the signing path end to end, and to keep custody of customer signing secrets within our own boundary.

Webhook delivery is too important to outsource and too boring to be exciting. That's a sign you should build it.

What's tricky

The hardest part isn't the delivery logic — it's the observability. When a customer reports 'I'm not getting webhooks,' the answer can be 'your endpoint is returning 500s,' 'your signature is failing,' 'you've been rate-limited,' or 'we're not sending them.' We built a delivery-log UI in the dashboard before we built the retry logic, because we knew we'd need it.


If you're building a SaaS that ships webhooks, write your own relay. Or buy a third-party one. But pick deliberately. The implicit choice — 'we'll add retries later' — is the worst of both worlds.

§ — KEEP READING

Related
posts.

More from the journal. New posts roughly every other week — engineering notes, product decisions, security writeups, the occasional changelog.