Designing Webhook Integrations That Don't Break in Production

Paul Obasi
April 5, 2020
APIs

Webhooks look simple on a whiteboard. Something happens, you send a message to a web address, and you are done. In actual production work, webhooks turned out to be one of the most fragile parts of any integration we built, because the ways they fail are invisible by default. A partner's address goes down for an hour, and unless you planned for that, nobody finds out until a customer complains weeks later.

The moment we decided a webhook mattered enough to build, we treated it as something worth logging completely. Every outbound call is saved with its data, its destination, the response we got back, and the time, before we consider the job finished, not only when it fails. This one habit turned a question like "did the partner get told about order 4471" from something nobody could answer into a five second search through our own records.

A receiving address that is down for thirty seconds should not cause us to lose the message, but a careless retry loop can turn a short partner outage into us flooding our own queue by accident. We use a backoff plan that waits longer between each retry, with a hard limit, after which the failure gets shown to an actual person instead of quietly dropped. Retrying forever and never retrying are both wrong choices. The system needs a clear point where it stops trying and asks for help.

It used to be tempting to skip signing a webhook that felt low stakes, an internal notice, something with no money involved. We stopped doing that. The cost of adding a signed header is small. The cost of explaining to a partner's security team why our messages were never signed is not small at all. We made signing part of how our webhook system works by default, not a choice each developer makes on their own for each new integration.

If our retry logic can send the same notice twice, and eventually it will, since networks are never perfectly reliable, the receiving side needs a way to spot a repeat. We include a steady event id in every message, and we tell partners plainly that a repeated message is expected, not a bug on our side. This one field alone removed an entire category of "we got charged twice" support tickets.

The question that always comes up during an actual incident is which webhooks failed in the last day, and for which customers. If answering that means searching through log files under pressure, the logging was built too late. We built a simple internal page that lists failed deliveries, how many times each one was retried, and a button to resend by hand. That page turned what used to be a two in the morning emergency into a five minute fix.

Maybeach Tech builds integration layers that hold up under actual network conditions, not just demo conditions. Get in touch and let us review your webhook setup.

Approval Workflows as Code: Designing Configurable Business Process Engines

LDAP and SSO Integration: A Practical Guide for Enterprise Apps

Related Post

Case Study

Useful Links

Contact Us