Skip to main content

DemozPay — SLOs + Alerting Reference

Snapshot: 2026-05-29 Companion to: PRODUCTION_READINESS.md, RECONCILIATION_ARCHITECTURE.md, infra/prometheus/alerts.yml.

Phase D foundation deliverable. Codifies SLOs + alert thresholds so when the alerting backbone (Grafana + Alertmanager + paging provider — separate GL-08 deliverable) deploys, the rules are already authored and reviewable.

What this document is

The single source of truth for:

  1. Service-level objectives — what the platform commits to.
  2. Alert thresholds — when to page on-call vs investigate at business hours.
  3. Escalation paths — who owns what.
  4. Mapping to Prometheus rules — every SLO maps to one or more rules in infra/prometheus/alerts.yml.

When a new metric ships, this doc + the alerts YAML get the same PR.

Status

LayerPhase D status
Metrics emitted (API + gateway + ledger + recon-runner)LIVE
Alert rules YAMLLIVE as code; not yet loaded by a Prometheus instance.
Prometheus scrape config templateLIVE as code; reference for operator.
Grafana / Alertmanager / paging provider deploymentPLANNED — GL-08 backbone.
Statement-pull automation feeding the recon-runnerPLANNED — GL-10.
Push-gateway / per-(tenant, partner) last-success timestamp metricPLANNED — recon-runner emits JSON today; operator wires Promtail/Loki OR adds pushgateway in a follow-up.

SLOs

Money correctness

SLOTargetError budgetMeasurement
Drift = 0 per (tenant, account) per day99.9% of (tenant, account, day) tuples1 in 1000 days may have non-zero driftdemozpay_ledger_reconcile_drift_santim from the ReconcileWithBank RPC, after daily run.
Ledger RPC error rate< 0.5% per RPC over 5min0.5% sustaineddemozpay_ledger_rpc_requests_total{outcome=~"internal|failed_precondition"} / total
Ledger RPC latencyp99 < 500ms across all RPCs< 1% of windows exceeddemozpay_ledger_rpc_latency_seconds_bucket p99 over 5min

Settlement lifecycle

SLOTargetError budgetMeasurement
Webhook → ledger POSTED99% within 60s of webhook1% of webhooks may exceed 60s per monthdemozpay_bank_settlement_apply_total{source="webhook",outcome="completed"} latency vs webhook receipt time.
Settlement (any source) → ledger POSTED99.9% within 4h of bank-side settlement0.1% per monthComposite of webhook + poller signals.
Settlement poller livenessA tick every 30s ± 5sNo more than 5 missed ticks in 24hdemozpay_settlement_poller_ticks_total rate.
Webhook signature acceptance rate> 99% of authentic webhook attempts< 1% sustained rejection rate excluding hostile trafficdemozpay_bank_webhook_requests_total{result} breakdown.

Account verification (Phase C)

SLOTargetError budgetMeasurement
LookupAccount latencyp95 < 1s per partner< 1% of windows exceeddemozpay_integration_gateway_lookup_latency_seconds_bucket p95 over 5min
LookupAccount partner availabilityPARTNER_UNAVAILABLE rate < 1% over 5min1% sustaineddemozpay_integration_gateway_lookup_failure_total{reason="PARTNER_UNAVAILABLE"} rate vs total.

Reconciliation cadence

SLOTargetError budgetMeasurement
Daily statement ingest success≥ 1 successful ingest per partner per day36h grace per monthStatement-pull automation success timestamp (PLANNED — GL-10).
Recon-runner completes dailyevery (tenant, partner) pair processed within 24h36h max gaprecon-runner exit code + last-success timestamp (D2 emits JSON; pushgateway swap pending).
Flagged-line rate per partner< 1% of statement lines per partner per day5% spike acceptable as warningTracked via recon-runner summary; metric demozpay_reconciliation_flagged_lines_total planned.

Authentication / authorization (Phase A)

SLOTargetError budgetMeasurement
AuthGuard 401 rate< 5% of authenticated routesNone — measured onlydemozpay_http_requests_total{status_code="401"}
OrgRoleGuard 403 rate< 1% of role-protected routes (sustained)None — measureddemozpay_http_requests_total{status_code="403"} (plus guard's WARN-log enumeration).

Dependency reachability

SLOTargetError budgetMeasurement
Postgres reachable99.99% uptime measured by /readyz< 1 min/monthdemozpay_dependency_up{dependency="postgres"}
Ledger gRPC reachable99.95%< 22 min/monthdemozpay_dependency_up{dependency="ledger"}
Integration gateway reachable99.95%< 22 min/monthdemozpay_dependency_up{dependency="integration-gateway"}
Redis reachable99.9% (cache miss tolerated)45 min/monthdemozpay_dependency_up{dependency="redis"}
Redpanda reachable99.9% (outbox can tolerate brief outage)45 min/monthdemozpay_dependency_up{dependency="kafka"}

Alert severity matrix

SeverityBehaviourExamples
criticalPage on-call immediately. Money is at risk OR a core RPC is down.drift > 0, ledger error rate spike, gateway down, settlement poller stalled, dependency down, partner unavailable on LookupAccount
warningInvestigate during business hours. SLO budget consumed faster than expected.outbox stale, LookupAccount p95 high, ledger p99 latency, flagged-line spike
infoNo paging; dashboards only.OrgRoleGuard 403 rate (signal of misconfigured client)

Mapping to alerts.yml

Alert (in YAML)SeverityMaps to SLO
DemozpayBankWebhookSignatureFailureRatecriticalSettlement: signature acceptance
DemozpayBankWebhookMissingHeaderswarningSettlement: signature acceptance
DemozpaySettlementPollerErrorRatewarningSettlement: poller liveness
DemozpaySettlementPollerStalledcriticalSettlement: poller liveness
DemozpayOutboxStalewarningOutbox publisher
DemozpayOutboxBacklogwarningOutbox publisher
DemozpayDependencyDowncriticalDependency reachability
DemozpayLookupPartnerUnavailablecriticalLookupAccount partner availability
DemozpayLookupPartnerTimeoutwarningLookupAccount latency
DemozpayLookupP95LatencywarningLookupAccount latency
DemozpayLedgerErrorRatecriticalLedger RPC error rate
DemozpayLedgerP99LatencywarningLedger RPC latency
DemozpayLedgerBankDriftNonZerocriticalDrift = 0
DemozpayReconciliationRunnerMissedDaycriticalRecon-runner completes daily
DemozpayReconciliationFlaggedLineSpikewarningFlagged-line rate

Escalation paths

(Operational ownership — defined here for clarity, finalised in permission-matrix.md-style ops doc at pilot kick-off.)

SeverityPrimary on-callEscalation 30minEscalation 2h
criticalintegration / platform on-callengineering leadCTO + finance lead
warningintegration / platform on-callengineering lead(business hours only)
infodashboards(no escalation)(no escalation)

Operational ownership

SurfaceOwnerSecondary
Bank webhook + applier metricsintegration teamplatform
Ledger RPC metricsplatform / ledger teamintegration
Settlement pollerintegration teamplatform
Outbox + Kafkaplatform teamintegration
LookupAccountintegration teamplatform
Recon-runner / daily driftfinance/ops + integrationplatform

What is LIVE today

  • Metrics emission on apps/api (Phase A baseline + Phase B settlement + Phase C lookup), gateway (Phase C lookup), ledger (Phase D — every RPC instrumented + RPC latency histogram + ledger-specific counters).
  • Alert rules authored (infra/prometheus/alerts.yml). Every rule has a runbook link (some runbooks are PLANNED).
  • Scrape config template (infra/prometheus/prometheus.yml).
  • Recon-runner emits structured JSON suitable for Slack-webhook / Promtail ingest. Pushgateway-style counter emission is a Phase D continuation item.

What is PLANNED

ItemEffortOwner
Deploy Prometheus + load alerts.yml~3 daysplatform
Deploy Alertmanager + paging provider integration (PagerDuty / Opsgenie / Slack)~5 daysplatform
Deploy Grafana + author dashboards from these SLOs~5 daysplatform
Add demozpay_reconciliation_* counters (recon-runner) and a pushgateway sidecar~2 daysintegration
Wire missing runbook stubs (outbox-publisher-stale.md, ledger-error-rate.md)~1 dayplatform + ledger team
On-call rota + tabletop drills1 week elapsedengineering lead
Statement-pull adapter feeding the recon-runner~1 weekintegration

What is intentionally NOT in scope

  • Customer-facing status page — separate workstream.
  • Synthetic monitoring (probes against partner APIs) — Q4 work.
  • Per-tenant alerting (would explode cardinality) — operator-level dashboards only.
  • Log-based alerts (only metrics-based for now) — Loki + log rules can be added once log shipping deploys.

Honest top-line

Phase D ships the alerting foundation as code:

  • Every metric the platform exposes has at least one alert rule.
  • Every alert rule has a runbook link (4 LIVE, 2 PLANNED).
  • Every SLO has a measurement source.

What remains is deployment — a separate week's work for the platform team, not a code gap. The cost of deploying is well-bounded; the rules and SLOs are pre-reviewed.

Cross-references