Runbooks
One file per alert or failure mode. Every alert that pages on-call must link to its runbook here. An alert without a runbook is an alert that won't get diagnosed at 3 AM.
Structure of a runbook
# Runbook: <alert name>
## Symptom
What the user / monitor sees.
## Likely causes
List in order of frequency.
## Diagnosis steps
1. Check X
2. Check Y
3. ...
## Mitigation
Immediate steps to stop the bleeding.
## Resolution
What to do to fix the root cause.
## Escalation
Who to wake up if mitigation doesn't work.
Written
S4.6 acceptance gate — the four runbooks the go-live gate requires:
-
webhook-failure.md— bank-callback HMAC verifier rejects sustained traffic; diagnosis byreasonlabel, mitigations per cause. -
gateway-down.md—services/integration-gatewayunhealthy / crashed; what survives (in-flight settlements) and what breaks (new disbursement attempts). -
drift-detected.md—Ledger.ReconcileWithBankreports non-zero drift; signed-drift causal branching, append-only-safe remediation. -
bank-statement-parse-failed.md— Dashen CSV ingester rejects a file or > 10% of its rows; format vs encoding vs data-quality split.
Reference + onboarding:
-
permission-matrix.md— Phase A / GL-01..03 canonical reference: every API endpoint's auth, tenant scope, actor source, role requirement, audit emission. Every PR adding an endpoint must update this file.
To be written (priority order, post-go-live)
-
ledger-balance-drift.md— drift between the Go-side independent sum and the derived view (internal correctness check, distinct from bank-side drift indrift-detected.md). -
partner-payout-failed.md— payout to bank/wallet failed at the gateway adapter and won't recover. -
auth-token-validation-failures.md— spike in JWT verification errors. -
db-replication-lag.md— read replica falls behind primary. -
kafka-consumer-lag.md— outbox events not being drained.