Runbook: Bank reconciliation drift detected
Owner: finance/ops + ledger team
Severity: CRITICAL — non-zero drift means the ledger and the bank disagree on cash. Every release is gated on 7 consecutive days of drift = 0 (S4.6 acceptance). Treat this as a release-blocking finding until resolved.
Symptom
Any one of:
- A
ReconcileWithBankRPC call logged at ERROR with"NON-ZERO DRIFT vs bank statement". Fields:tenant_id,account_id,ledger_total,statement_total,drift(signed santim),currency,entries_count. - Alert
ledger_bank_drift_santim_total(sum of |drift| across accounts in the last 24h) > 0. - Dashboard panel "S4 drift heartbeat" shows non-zero for any tenant.
- Reconciliation matcher flagged a statement line with
value_date drift exceeds ...oramount mismatch ...(matcher-side drift, distinct from RPC-side).
The signed drift field tells you which side is heavier:
drift > 0→ ledger thinks more money moved than the bank does. We booked something the bank didn't.drift < 0→ bank thinks more money moved than the ledger does. The bank settled something we didn't book.
Likely causes
In rough order of likelihood, narrowed by the sign of drift:
Drift > 0 (ledger > bank)
- A
ConfirmSettlementfired on a transaction the bank later REJECTED out-of-band. The reverse webhook didn't make it; the row is POSTED but no cash actually moved. Most common cause. - Bank statement period misalignment. Our
period_start/period_endbrackets a transaction the bank reports on the next statement. Resolves on the next run; not a real bug. - A POSTED entry's accounting sign is wrong — should have been CREDIT, was inserted as DEBIT (or vice versa). This is a code bug in whatever use case posted the entry.
Drift < 0 (bank > ledger)
- The bank settled a transfer we never initiated. Either a duplicate at the bank (rare) or someone is using our bank account directly outside the platform (treat as security incident).
- A
ConfirmSettlementfailed (DB outage, crash mid-RPC) and the API state is wrong. The bank's webhook said COMPLETED; our ledger never moved. - A bank fee / adjustment line. The bank posted a fee against our account that we don't model in the ledger.
Diagnosis steps
- Capture the drift value. From the alert payload (or
kubectl logs deploy/ledger):tenant_id=... account_id=... ledger_total=X statement_total=Y drift=Z entries_count=N - Sign of
drift→ branch. Use the table above to narrow the cause. - Enumerate the POSTED entries the ledger summed.
-- Run against pg-ledger as a BYPASSRLS role, OR set app.tenant_id first.SELECT e.transaction_id, t.description, e.direction, e.amount_santim,e.created_at, t.status, t.metadataFROM ledger_entry eJOIN ledger_transaction t ON t.id = e.transaction_idWHERE e.tenant_id = '<tenant>'AND e.account_id = '<account>'AND t.status = 'POSTED'AND e.created_at >= '<period_start>'AND e.created_at <= '<period_end>'ORDER BY e.created_at;
- Enumerate the bank statement lines for the same period.
SELECT partner_reference, amount_santim, value_date, description,matched_disbursement_id, flagged, flagged_reasonFROM bank_statement_lineWHERE tenant_id = '<tenant>'AND partner = '<partner>'AND value_date >= '<period_start>'::dateAND value_date <= '<period_end>'::dateORDER BY value_date;
- Diff the two sets manually. A
drift > 0case will have an entry in step 3 with no matching row in step 4. Conversely fordrift < 0. Usepartner_referenceas the join key when possible (the matcher already does this; flagged rows tell you which lines couldn't be matched). - For each discrepancy, identify the originating use case by reading
ledger_transaction.metadata(which carriesewa_id/loan_id/disbursement_id). Read the audit log row for that aggregate to reconstruct the timeline.
Mitigation
DO NOT attempt to "correct" the ledger by inserting or updating rows. The ledger is append-only (ADR-009). All corrections are forward motion through legitimate state transitions.
- For cause 1 (ledger > bank, errant ConfirmSettlement):
- Call
Reverseagainst the transaction id (idempotent; usesLedger.ReverseRPC). This appends a compensating transaction; the original stays POSTED for audit. - File the EWA / loan as
BANK_REJECTEDvia a manual admin tool (DO NOT run UPDATE directly — go through the use case).
- Call
- For cause 2 (period misalignment): nothing to mitigate; just confirm by re-running the next day and watching drift fall to zero.
- For cause 3 (sign error): find every transaction posted with the broken sign rule (
ledger_transaction.metadata -> 'use_case'filter), reverse them all, ship a code fix, then re-post correctly. This is a code-bug-grade incident. - For cause 4 (bank settled something we didn't): STOP. This is a security incident. Lock the bank account at the partner side (call them), preserve every artifact, page security on-call.
- For cause 5 (failed ConfirmSettlement): the partner_reference and amount are known from the bank statement line. Call
ConfirmSettlementmanually with the original ledger tx id (thedisbursementtable at the gateway carries the link). Verify the entry now appears in the ledger total. - For cause 6 (bank fee): post the fee through the canonical "bank fee" use case (TODO: doesn't yet exist; track separately). Until it does, treat as "model gap" and document the residual drift instead of fixing it.
Resolution
- Cause 1: investigate the missed reverse-webhook. Is the partner sending it? Is the settlement-poller running? Is
BANK_WEBHOOK_SIGNING_KEYcorrect? Cross-reference with thewebhook-failure.mdrunbook. - Cause 2: harden the daily-recon period boundary to match the partner's own settlement cutover (e.g. T+1 at 06:00 UTC for Dashen). Update
period_endaccordingly in the cron. - Cause 3: code review the offending use case. Add a unit test for the sign of every account entry it posts.
- Cause 4: security incident — full forensic + bank-side investigation. Do not unblock the release until root-caused.
- Cause 5: improve
BankSettlementApplier's error handling so a mid-RPC failure surfaces an actionable alert rather than going silent. - Cause 6: model "bank fee" + "interest accrual" as first-class use cases with their own ledger account taxonomy. Until shipped, the residual drift will be exactly the un-modelled categories per period.
Escalation
- Drift > 0 with a clear
cause 1signature → on-call resolves it (Reverse + manual admin) and files a follow-up. No page. - Drift detected during the S4.6 soak window → reset the 7-day counter. Release is blocked. Notify engineering lead + finance lead.
- Cause 4 (potential fraud / external access) → page security on-call IMMEDIATELY.
- Drift sustained > 24h across multiple runs → page finance + engineering leads jointly.
Related
- The
ReconcileWithBankRPC + handler:services/ledger/internal/server/reconcile_with_bank.go,internal/store/postgres_store.go::ReconcileWithBank. - Matcher-side flagging (statement lines that don't match our
disbursementtable):services/integration-gateway/internal/reconciliation/matcher.go. Flagged rows carry their reason; that's another route to discovering drift before the ledger sum catches up. - Append-only ledger contract: ADR-009.