Runbook: Bank webhook failure

Endpoint affected: POST /api/integration/bank-callback/:partner Owner: integration team Severity: HIGH (when sustained — a single rejection is expected for replay/scan traffic; a sustained spike means real disbursements aren't being acknowledged)

Symptom

Any one of:

Alert bank_webhook_signature_rejected_rate exceeds 5% of accepted volume over a 5-minute window.
Alert bank_webhook_error_total non-zero (server crashed before the verifier ran).
Customer-reported: an EWA / loan disbursement stuck in SUBMITTED_TO_BANK past the expected settlement window even though the partner says it cleared.
demozpay_bank_webhook_requests_total{result="signature-rejected"} jumps without a matching jump in result="accepted".

Likely causes

In order of frequency in production fintech webhooks:

Clock skew between partner and our edge — partner timestamp drifts ≥ 5 minutes from our now(). Container clock not NTP-synced is the usual culprit.
HMAC secret rotated on one side but not the other. Partner rotated their signing key; we didn't redeploy with the new BANK_WEBHOOK_SIGNING_KEY.
Body-mangling proxy in front of the API. Anything that re-serialises the JSON (some WAFs / API gateways do) breaks the signature even though headers look right.
Bypassed rawBody: true config in NestJS. A redeploy with main.ts modified can drop the option; the controller then returns 401 because req.rawBody is undefined.
Partner actually replaying old captures at us. Real attackers, but also common during bank-side QA reruns. The 5-minute skew window catches it; the signal is real.

Diagnosis steps

Check the rejection reason in the logs. The controller logs bank-callback signature rejected: partner=<x>, reason=<y> at WARN level. reason is one of missing-headers / invalid-timestamp / clock-skew / signature-mismatch. Each pin-points the cause:
- clock-skew → cause 1.
- signature-mismatch → causes 2, 3, or 5 (need more diagnosis below).
- missing-headers → either a malformed partner request (rare on production) or a proxy stripping X-Demoz-* headers.
- invalid-timestamp → partner sent something that wasn't RFC 3339; this is a partner-side bug.
For clock-skew: SSH to the API host and run date -u. Compare against https://time.is/UTC or an NTP probe. If drift > 1 minute, the host's clock service is broken; restart systemd-timesyncd / chronyd.
For signature-mismatch: tail the partner's raw request from your access log (if you have raw-body capture; if not, see Resolution below). Re-compute the HMAC by hand: printf "${ts}.${nonce}.${body}" | openssl dgst -sha256 -hmac "${BANK_WEBHOOK_SIGNING_KEY}". If your hand-computed signature matches what the partner sent → the secret in env is wrong. If it doesn't → the body bytes on the wire differ from what you logged (proxy mangling — cause 3).
For missing-headers on a real partner: curl -v the partner's confirmed webhook URL with a known-good payload. If you don't see the X-Demoz-* headers in the request your API received, a proxy stripped them.
For sustained signature-rejected with no matching accepted: the partner is misconfigured OR a credential-stuffing attempt. Check partner label distribution; if it's all coming from one partner that previously worked, that partner rotated keys. If it's mixed and traffic is artificially high, treat as attack (see Escalation).

Mitigation

In order of immediacy:

Clock skew (host): systemctl restart chronyd (Amazon Linux / RHEL) or systemctl restart systemd-timesyncd (Debian / Ubuntu). Confirm with chronyc tracking / timedatectl status.
Wrong secret: roll BANK_WEBHOOK_SIGNING_KEY (Kubernetes Secret / SSM Parameter). Confirm the application picks it up — the value is read at boot, so a rolling restart of the api deployment is required.
Proxy mangling: configure the ingress to pass body bytes through unmodified. For nginx in front of the API: ensure proxy_pass_request_body on; and remove any proxy_set_header Content-Type override that re-canonicalises Content-Type to a different charset.
Don't disable signature verification to "unblock" unless the security lead explicitly signs off — every accepted webhook becomes an unauthenticated state-change.

Resolution

For cause 1 (clock): file a config bug on the host image to add an NTP probe to the boot health check. Add chronyc tracking | grep "Last offset" as an indicator alert.
For cause 2 (rotation): write the secret-rotation procedure into the on-call rota. Set the next rotation as a calendar event aligned with the partner's published schedule.
For cause 3 (proxy mangling): pin Content-Type handling in the ingress config. Add a smoke that POSTs a known-good signed payload via the production ingress and asserts 200 — run in pre-deploy.
For cause 4 (rawBody dropped): gate the deploy on apps/api/src/main.ts containing rawBody: true. Add a release check.
For cause 5 (attack): confirm with the security lead that the rejected payloads are not part of a wider campaign. Add a rate limit at the ingress on the bank-callback route.

Escalation

1 hour without diagnosis → page the platform on-call.
Suspected attack (cause 5) → page security on-call IMMEDIATELY (do not wait for diagnosis).
If the partner's settlement lifecycle is impacted because we're failing legitimate webhooks for > 10 minutes → page finance/ops; they may need to manually open a recovery channel with the partner.

The HMAC scheme is documented at apps/api/src/money/integration/hmac.ts.
Settlement-poller is the fallback when webhooks fail entirely — confirm SETTLEMENT_POLLER_ENABLED=true on at least one API replica. The poller's behaviour is in apps/api/src/money/integration/settlement-poller.service.ts.
Test harness: services/bank-sandbox/test/verify-s3.sh exercises this end-to-end. It's the closest thing to a production replay.

Symptom​

Likely causes​

Diagnosis steps​

Mitigation​

Resolution​

Escalation​

Related​