Skip to main content

Runbooks

One file per alert or failure mode. Every alert that pages on-call must link to its runbook here. An alert without a runbook is an alert that won't get diagnosed at 3 AM.

Structure of a runbook

# Runbook: <alert name>

## Symptom
What the user / monitor sees.

## Likely causes
List in order of frequency.

## Diagnosis steps
1. Check X
2. Check Y
3. ...

## Mitigation
Immediate steps to stop the bleeding.

## Resolution
What to do to fix the root cause.

## Escalation
Who to wake up if mitigation doesn't work.

Written

S4.6 acceptance gate — the four runbooks the go-live gate requires:

  • webhook-failure.md — bank-callback HMAC verifier rejects sustained traffic; diagnosis by reason label, mitigations per cause.
  • gateway-down.mdservices/integration-gateway unhealthy / crashed; what survives (in-flight settlements) and what breaks (new disbursement attempts).
  • drift-detected.mdLedger.ReconcileWithBank reports non-zero drift; signed-drift causal branching, append-only-safe remediation.
  • bank-statement-parse-failed.md — Dashen CSV ingester rejects a file or > 10% of its rows; format vs encoding vs data-quality split.

Reference + onboarding:

  • permission-matrix.md — Phase A / GL-01..03 canonical reference: every API endpoint's auth, tenant scope, actor source, role requirement, audit emission. Every PR adding an endpoint must update this file.

To be written (priority order, post-go-live)

  • ledger-balance-drift.md — drift between the Go-side independent sum and the derived view (internal correctness check, distinct from bank-side drift in drift-detected.md).
  • partner-payout-failed.md — payout to bank/wallet failed at the gateway adapter and won't recover.
  • auth-token-validation-failures.md — spike in JWT verification errors.
  • db-replication-lag.md — read replica falls behind primary.
  • kafka-consumer-lag.md — outbox events not being drained.