Runbook: Bank webhook failure
Endpoint affected: POST /api/integration/bank-callback/:partner
Owner: integration team
Severity: HIGH (when sustained — a single rejection is expected for replay/scan traffic; a sustained spike means real disbursements aren't being acknowledged)
Symptom
Any one of:
- Alert
bank_webhook_signature_rejected_rateexceeds 5% of accepted volume over a 5-minute window. - Alert
bank_webhook_error_totalnon-zero (server crashed before the verifier ran). - Customer-reported: an EWA / loan disbursement stuck in
SUBMITTED_TO_BANKpast the expected settlement window even though the partner says it cleared. demozpay_bank_webhook_requests_total{result="signature-rejected"}jumps without a matching jump inresult="accepted".
Likely causes
In order of frequency in production fintech webhooks:
- Clock skew between partner and our edge — partner timestamp drifts ≥ 5 minutes from our
now(). Container clock not NTP-synced is the usual culprit. - HMAC secret rotated on one side but not the other. Partner rotated their signing key; we didn't redeploy with the new
BANK_WEBHOOK_SIGNING_KEY. - Body-mangling proxy in front of the API. Anything that re-serialises the JSON (some WAFs / API gateways do) breaks the signature even though headers look right.
- Bypassed
rawBody: trueconfig in NestJS. A redeploy withmain.tsmodified can drop the option; the controller then returns 401 becausereq.rawBodyis undefined. - Partner actually replaying old captures at us. Real attackers, but also common during bank-side QA reruns. The 5-minute skew window catches it; the signal is real.
Diagnosis steps
- Check the rejection reason in the logs. The controller logs
bank-callback signature rejected: partner=<x>, reason=<y>at WARN level.reasonis one ofmissing-headers / invalid-timestamp / clock-skew / signature-mismatch. Each pin-points the cause:clock-skew→ cause 1.signature-mismatch→ causes 2, 3, or 5 (need more diagnosis below).missing-headers→ either a malformed partner request (rare on production) or a proxy strippingX-Demoz-*headers.invalid-timestamp→ partner sent something that wasn't RFC 3339; this is a partner-side bug.
- For
clock-skew: SSH to the API host and rundate -u. Compare againsthttps://time.is/UTCor an NTP probe. If drift > 1 minute, the host's clock service is broken; restartsystemd-timesyncd/chronyd. - For
signature-mismatch: tail the partner's raw request from your access log (if you have raw-body capture; if not, see Resolution below). Re-compute the HMAC by hand:printf "${ts}.${nonce}.${body}" | openssl dgst -sha256 -hmac "${BANK_WEBHOOK_SIGNING_KEY}". If your hand-computed signature matches what the partner sent → the secret in env is wrong. If it doesn't → the body bytes on the wire differ from what you logged (proxy mangling — cause 3). - For
missing-headerson a real partner:curl -vthe partner's confirmed webhook URL with a known-good payload. If you don't see theX-Demoz-*headers in the request your API received, a proxy stripped them. - For sustained
signature-rejectedwith no matchingaccepted: the partner is misconfigured OR a credential-stuffing attempt. Checkpartnerlabel distribution; if it's all coming from one partner that previously worked, that partner rotated keys. If it's mixed and traffic is artificially high, treat as attack (see Escalation).
Mitigation
In order of immediacy:
- Clock skew (host):
systemctl restart chronyd(Amazon Linux / RHEL) orsystemctl restart systemd-timesyncd(Debian / Ubuntu). Confirm withchronyc tracking/timedatectl status. - Wrong secret: roll
BANK_WEBHOOK_SIGNING_KEY(Kubernetes Secret / SSM Parameter). Confirm the application picks it up — the value is read at boot, so a rolling restart of theapideployment is required. - Proxy mangling: configure the ingress to pass body bytes through unmodified. For nginx in front of the API: ensure
proxy_pass_request_body on;and remove anyproxy_set_header Content-Typeoverride that re-canonicalises Content-Type to a different charset. - Don't disable signature verification to "unblock" unless the security lead explicitly signs off — every accepted webhook becomes an unauthenticated state-change.
Resolution
- For cause 1 (clock): file a config bug on the host image to add an NTP probe to the boot health check. Add
chronyc tracking | grep "Last offset"as an indicator alert. - For cause 2 (rotation): write the secret-rotation procedure into the on-call rota. Set the next rotation as a calendar event aligned with the partner's published schedule.
- For cause 3 (proxy mangling): pin Content-Type handling in the ingress config. Add a smoke that POSTs a known-good signed payload via the production ingress and asserts 200 — run in pre-deploy.
- For cause 4 (rawBody dropped): gate the deploy on
apps/api/src/main.tscontainingrawBody: true. Add a release check. - For cause 5 (attack): confirm with the security lead that the rejected payloads are not part of a wider campaign. Add a rate limit at the ingress on the
bank-callbackroute.
Escalation
- 1 hour without diagnosis → page the platform on-call.
- Suspected attack (cause 5) → page security on-call IMMEDIATELY (do not wait for diagnosis).
- If the partner's settlement lifecycle is impacted because we're failing legitimate webhooks for > 10 minutes → page finance/ops; they may need to manually open a recovery channel with the partner.
Related
- The HMAC scheme is documented at
apps/api/src/money/integration/hmac.ts. - Settlement-poller is the fallback when webhooks fail entirely — confirm
SETTLEMENT_POLLER_ENABLED=trueon at least one API replica. The poller's behaviour is inapps/api/src/money/integration/settlement-poller.service.ts. - Test harness:
services/bank-sandbox/test/verify-s3.shexercises this end-to-end. It's the closest thing to a production replay.