DemozPay — SLOs + Alerting Reference
Snapshot: 2026-05-29
Companion to: PRODUCTION_READINESS.md, RECONCILIATION_ARCHITECTURE.md, infra/prometheus/alerts.yml.
Phase D foundation deliverable. Codifies SLOs + alert thresholds so when the alerting backbone (Grafana + Alertmanager + paging provider — separate GL-08 deliverable) deploys, the rules are already authored and reviewable.
What this document is
The single source of truth for:
- Service-level objectives — what the platform commits to.
- Alert thresholds — when to page on-call vs investigate at business hours.
- Escalation paths — who owns what.
- Mapping to Prometheus rules — every SLO maps to one or more rules in
infra/prometheus/alerts.yml.
When a new metric ships, this doc + the alerts YAML get the same PR.
Status
| Layer | Phase D status |
|---|
| Metrics emitted (API + gateway + ledger + recon-runner) | LIVE |
| Alert rules YAML | LIVE as code; not yet loaded by a Prometheus instance. |
| Prometheus scrape config template | LIVE as code; reference for operator. |
| Grafana / Alertmanager / paging provider deployment | PLANNED — GL-08 backbone. |
| Statement-pull automation feeding the recon-runner | PLANNED — GL-10. |
| Push-gateway / per-(tenant, partner) last-success timestamp metric | PLANNED — recon-runner emits JSON today; operator wires Promtail/Loki OR adds pushgateway in a follow-up. |
SLOs
Money correctness
| SLO | Target | Error budget | Measurement |
|---|
| Drift = 0 per (tenant, account) per day | 99.9% of (tenant, account, day) tuples | 1 in 1000 days may have non-zero drift | demozpay_ledger_reconcile_drift_santim from the ReconcileWithBank RPC, after daily run. |
| Ledger RPC error rate | < 0.5% per RPC over 5min | 0.5% sustained | demozpay_ledger_rpc_requests_total{outcome=~"internal|failed_precondition"} / total |
| Ledger RPC latency | p99 < 500ms across all RPCs | < 1% of windows exceed | demozpay_ledger_rpc_latency_seconds_bucket p99 over 5min |
Settlement lifecycle
| SLO | Target | Error budget | Measurement |
|---|
| Webhook → ledger POSTED | 99% within 60s of webhook | 1% of webhooks may exceed 60s per month | demozpay_bank_settlement_apply_total{source="webhook",outcome="completed"} latency vs webhook receipt time. |
| Settlement (any source) → ledger POSTED | 99.9% within 4h of bank-side settlement | 0.1% per month | Composite of webhook + poller signals. |
| Settlement poller liveness | A tick every 30s ± 5s | No more than 5 missed ticks in 24h | demozpay_settlement_poller_ticks_total rate. |
| Webhook signature acceptance rate | > 99% of authentic webhook attempts | < 1% sustained rejection rate excluding hostile traffic | demozpay_bank_webhook_requests_total{result} breakdown. |
Account verification (Phase C)
| SLO | Target | Error budget | Measurement |
|---|
| LookupAccount latency | p95 < 1s per partner | < 1% of windows exceed | demozpay_integration_gateway_lookup_latency_seconds_bucket p95 over 5min |
| LookupAccount partner availability | PARTNER_UNAVAILABLE rate < 1% over 5min | 1% sustained | demozpay_integration_gateway_lookup_failure_total{reason="PARTNER_UNAVAILABLE"} rate vs total. |
Reconciliation cadence
| SLO | Target | Error budget | Measurement |
|---|
| Daily statement ingest success | ≥ 1 successful ingest per partner per day | 36h grace per month | Statement-pull automation success timestamp (PLANNED — GL-10). |
| Recon-runner completes daily | every (tenant, partner) pair processed within 24h | 36h max gap | recon-runner exit code + last-success timestamp (D2 emits JSON; pushgateway swap pending). |
| Flagged-line rate per partner | < 1% of statement lines per partner per day | 5% spike acceptable as warning | Tracked via recon-runner summary; metric demozpay_reconciliation_flagged_lines_total planned. |
Authentication / authorization (Phase A)
| SLO | Target | Error budget | Measurement |
|---|
| AuthGuard 401 rate | < 5% of authenticated routes | None — measured only | demozpay_http_requests_total{status_code="401"} |
| OrgRoleGuard 403 rate | < 1% of role-protected routes (sustained) | None — measured | demozpay_http_requests_total{status_code="403"} (plus guard's WARN-log enumeration). |
Dependency reachability
| SLO | Target | Error budget | Measurement |
|---|
| Postgres reachable | 99.99% uptime measured by /readyz | < 1 min/month | demozpay_dependency_up{dependency="postgres"} |
| Ledger gRPC reachable | 99.95% | < 22 min/month | demozpay_dependency_up{dependency="ledger"} |
| Integration gateway reachable | 99.95% | < 22 min/month | demozpay_dependency_up{dependency="integration-gateway"} |
| Redis reachable | 99.9% (cache miss tolerated) | 45 min/month | demozpay_dependency_up{dependency="redis"} |
| Redpanda reachable | 99.9% (outbox can tolerate brief outage) | 45 min/month | demozpay_dependency_up{dependency="kafka"} |
Alert severity matrix
| Severity | Behaviour | Examples |
|---|
| critical | Page on-call immediately. Money is at risk OR a core RPC is down. | drift > 0, ledger error rate spike, gateway down, settlement poller stalled, dependency down, partner unavailable on LookupAccount |
| warning | Investigate during business hours. SLO budget consumed faster than expected. | outbox stale, LookupAccount p95 high, ledger p99 latency, flagged-line spike |
| info | No paging; dashboards only. | OrgRoleGuard 403 rate (signal of misconfigured client) |
Mapping to alerts.yml
| Alert (in YAML) | Severity | Maps to SLO |
|---|
DemozpayBankWebhookSignatureFailureRate | critical | Settlement: signature acceptance |
DemozpayBankWebhookMissingHeaders | warning | Settlement: signature acceptance |
DemozpaySettlementPollerErrorRate | warning | Settlement: poller liveness |
DemozpaySettlementPollerStalled | critical | Settlement: poller liveness |
DemozpayOutboxStale | warning | Outbox publisher |
DemozpayOutboxBacklog | warning | Outbox publisher |
DemozpayDependencyDown | critical | Dependency reachability |
DemozpayLookupPartnerUnavailable | critical | LookupAccount partner availability |
DemozpayLookupPartnerTimeout | warning | LookupAccount latency |
DemozpayLookupP95Latency | warning | LookupAccount latency |
DemozpayLedgerErrorRate | critical | Ledger RPC error rate |
DemozpayLedgerP99Latency | warning | Ledger RPC latency |
DemozpayLedgerBankDriftNonZero | critical | Drift = 0 |
DemozpayReconciliationRunnerMissedDay | critical | Recon-runner completes daily |
DemozpayReconciliationFlaggedLineSpike | warning | Flagged-line rate |
Escalation paths
(Operational ownership — defined here for clarity, finalised in permission-matrix.md-style ops doc at pilot kick-off.)
| Severity | Primary on-call | Escalation 30min | Escalation 2h |
|---|
| critical | integration / platform on-call | engineering lead | CTO + finance lead |
| warning | integration / platform on-call | engineering lead | (business hours only) |
| info | dashboards | (no escalation) | (no escalation) |
Operational ownership
| Surface | Owner | Secondary |
|---|
| Bank webhook + applier metrics | integration team | platform |
| Ledger RPC metrics | platform / ledger team | integration |
| Settlement poller | integration team | platform |
| Outbox + Kafka | platform team | integration |
| LookupAccount | integration team | platform |
| Recon-runner / daily drift | finance/ops + integration | platform |
What is LIVE today
- Metrics emission on apps/api (Phase A baseline + Phase B settlement + Phase C lookup), gateway (Phase C lookup), ledger (Phase D — every RPC instrumented + RPC latency histogram + ledger-specific counters).
- Alert rules authored (
infra/prometheus/alerts.yml). Every rule has a runbook link (some runbooks are PLANNED).
- Scrape config template (
infra/prometheus/prometheus.yml).
- Recon-runner emits structured JSON suitable for Slack-webhook / Promtail ingest. Pushgateway-style counter emission is a Phase D continuation item.
What is PLANNED
| Item | Effort | Owner |
|---|
Deploy Prometheus + load alerts.yml | ~3 days | platform |
| Deploy Alertmanager + paging provider integration (PagerDuty / Opsgenie / Slack) | ~5 days | platform |
| Deploy Grafana + author dashboards from these SLOs | ~5 days | platform |
Add demozpay_reconciliation_* counters (recon-runner) and a pushgateway sidecar | ~2 days | integration |
Wire missing runbook stubs (outbox-publisher-stale.md, ledger-error-rate.md) | ~1 day | platform + ledger team |
| On-call rota + tabletop drills | 1 week elapsed | engineering lead |
| Statement-pull adapter feeding the recon-runner | ~1 week | integration |
What is intentionally NOT in scope
- Customer-facing status page — separate workstream.
- Synthetic monitoring (probes against partner APIs) — Q4 work.
- Per-tenant alerting (would explode cardinality) — operator-level dashboards only.
- Log-based alerts (only metrics-based for now) — Loki + log rules can be added once log shipping deploys.
Honest top-line
Phase D ships the alerting foundation as code:
- Every metric the platform exposes has at least one alert rule.
- Every alert rule has a runbook link (4 LIVE, 2 PLANNED).
- Every SLO has a measurement source.
What remains is deployment — a separate week's work for the platform team, not a code gap. The cost of deploying is well-bounded; the rules and SLOs are pre-reviewed.
Cross-references