Skip to main content

Runbook: Bank reconciliation drift detected

Owner: finance/ops + ledger team Severity: CRITICAL — non-zero drift means the ledger and the bank disagree on cash. Every release is gated on 7 consecutive days of drift = 0 (S4.6 acceptance). Treat this as a release-blocking finding until resolved.

Symptom

Any one of:

  • A ReconcileWithBank RPC call logged at ERROR with "NON-ZERO DRIFT vs bank statement". Fields: tenant_id, account_id, ledger_total, statement_total, drift (signed santim), currency, entries_count.
  • Alert ledger_bank_drift_santim_total (sum of |drift| across accounts in the last 24h) > 0.
  • Dashboard panel "S4 drift heartbeat" shows non-zero for any tenant.
  • Reconciliation matcher flagged a statement line with value_date drift exceeds ... or amount mismatch ... (matcher-side drift, distinct from RPC-side).

The signed drift field tells you which side is heavier:

  • drift > 0 → ledger thinks more money moved than the bank does. We booked something the bank didn't.
  • drift < 0 → bank thinks more money moved than the ledger does. The bank settled something we didn't book.

Likely causes

In rough order of likelihood, narrowed by the sign of drift:

Drift > 0 (ledger > bank)

  1. A ConfirmSettlement fired on a transaction the bank later REJECTED out-of-band. The reverse webhook didn't make it; the row is POSTED but no cash actually moved. Most common cause.
  2. Bank statement period misalignment. Our period_start/period_end brackets a transaction the bank reports on the next statement. Resolves on the next run; not a real bug.
  3. A POSTED entry's accounting sign is wrong — should have been CREDIT, was inserted as DEBIT (or vice versa). This is a code bug in whatever use case posted the entry.

Drift < 0 (bank > ledger)

  1. The bank settled a transfer we never initiated. Either a duplicate at the bank (rare) or someone is using our bank account directly outside the platform (treat as security incident).
  2. A ConfirmSettlement failed (DB outage, crash mid-RPC) and the API state is wrong. The bank's webhook said COMPLETED; our ledger never moved.
  3. A bank fee / adjustment line. The bank posted a fee against our account that we don't model in the ledger.

Diagnosis steps

  1. Capture the drift value. From the alert payload (or kubectl logs deploy/ledger):
    tenant_id=... account_id=... ledger_total=X statement_total=Y drift=Z entries_count=N
  2. Sign of drift → branch. Use the table above to narrow the cause.
  3. Enumerate the POSTED entries the ledger summed.
    -- Run against pg-ledger as a BYPASSRLS role, OR set app.tenant_id first.
    SELECT e.transaction_id, t.description, e.direction, e.amount_santim,
    e.created_at, t.status, t.metadata
    FROM ledger_entry e
    JOIN ledger_transaction t ON t.id = e.transaction_id
    WHERE e.tenant_id = '<tenant>'
    AND e.account_id = '<account>'
    AND t.status = 'POSTED'
    AND e.created_at >= '<period_start>'
    AND e.created_at <= '<period_end>'
    ORDER BY e.created_at;
  4. Enumerate the bank statement lines for the same period.
    SELECT partner_reference, amount_santim, value_date, description,
    matched_disbursement_id, flagged, flagged_reason
    FROM bank_statement_line
    WHERE tenant_id = '<tenant>'
    AND partner = '<partner>'
    AND value_date >= '<period_start>'::date
    AND value_date <= '<period_end>'::date
    ORDER BY value_date;
  5. Diff the two sets manually. A drift > 0 case will have an entry in step 3 with no matching row in step 4. Conversely for drift < 0. Use partner_reference as the join key when possible (the matcher already does this; flagged rows tell you which lines couldn't be matched).
  6. For each discrepancy, identify the originating use case by reading ledger_transaction.metadata (which carries ewa_id / loan_id / disbursement_id). Read the audit log row for that aggregate to reconstruct the timeline.

Mitigation

DO NOT attempt to "correct" the ledger by inserting or updating rows. The ledger is append-only (ADR-009). All corrections are forward motion through legitimate state transitions.

  1. For cause 1 (ledger > bank, errant ConfirmSettlement):
    • Call Reverse against the transaction id (idempotent; uses Ledger.Reverse RPC). This appends a compensating transaction; the original stays POSTED for audit.
    • File the EWA / loan as BANK_REJECTED via a manual admin tool (DO NOT run UPDATE directly — go through the use case).
  2. For cause 2 (period misalignment): nothing to mitigate; just confirm by re-running the next day and watching drift fall to zero.
  3. For cause 3 (sign error): find every transaction posted with the broken sign rule (ledger_transaction.metadata -> 'use_case' filter), reverse them all, ship a code fix, then re-post correctly. This is a code-bug-grade incident.
  4. For cause 4 (bank settled something we didn't): STOP. This is a security incident. Lock the bank account at the partner side (call them), preserve every artifact, page security on-call.
  5. For cause 5 (failed ConfirmSettlement): the partner_reference and amount are known from the bank statement line. Call ConfirmSettlement manually with the original ledger tx id (the disbursement table at the gateway carries the link). Verify the entry now appears in the ledger total.
  6. For cause 6 (bank fee): post the fee through the canonical "bank fee" use case (TODO: doesn't yet exist; track separately). Until it does, treat as "model gap" and document the residual drift instead of fixing it.

Resolution

  • Cause 1: investigate the missed reverse-webhook. Is the partner sending it? Is the settlement-poller running? Is BANK_WEBHOOK_SIGNING_KEY correct? Cross-reference with the webhook-failure.md runbook.
  • Cause 2: harden the daily-recon period boundary to match the partner's own settlement cutover (e.g. T+1 at 06:00 UTC for Dashen). Update period_end accordingly in the cron.
  • Cause 3: code review the offending use case. Add a unit test for the sign of every account entry it posts.
  • Cause 4: security incident — full forensic + bank-side investigation. Do not unblock the release until root-caused.
  • Cause 5: improve BankSettlementApplier's error handling so a mid-RPC failure surfaces an actionable alert rather than going silent.
  • Cause 6: model "bank fee" + "interest accrual" as first-class use cases with their own ledger account taxonomy. Until shipped, the residual drift will be exactly the un-modelled categories per period.

Escalation

  • Drift > 0 with a clear cause 1 signature → on-call resolves it (Reverse + manual admin) and files a follow-up. No page.
  • Drift detected during the S4.6 soak window → reset the 7-day counter. Release is blocked. Notify engineering lead + finance lead.
  • Cause 4 (potential fraud / external access) → page security on-call IMMEDIATELY.
  • Drift sustained > 24h across multiple runs → page finance + engineering leads jointly.
  • The ReconcileWithBank RPC + handler: services/ledger/internal/server/reconcile_with_bank.go, internal/store/postgres_store.go::ReconcileWithBank.
  • Matcher-side flagging (statement lines that don't match our disbursement table): services/integration-gateway/internal/reconciliation/matcher.go. Flagged rows carry their reason; that's another route to discovering drift before the ledger sum catches up.
  • Append-only ledger contract: ADR-009.