DemozPay — Production Readiness Audit
Snapshot: 2026-05-29 Author: Principal Architect / Acting CTO Companion to:
REAL_SYSTEM_STATE.md,GO_LIVE_BLOCKERS.md. Scope: the operational surface — deploy, observe, recover, secure, support. Code correctness is covered elsewhere.
Scoring rubric
For each capability, four-bucket scoring:
| Score | Meaning |
|---|---|
| GREEN | Production-acceptable today. Real evidence; passes a vendor audit. |
| YELLOW | Works in dev / a single happy path. Will not survive partial failure, scale, or a real incident. |
| RED | Not implemented. A first-week production incident will surface this. |
| BLACK | Implemented in a way that creates incident risk. Worse than absent. |
The summary at the top of every section is the honest score; the table below it explains why.
§1. Observability — YELLOW (improved by Phase C + D)
Headline (updated Phase D, 2026-05-29): the metric surface is now full across API + gateway + ledger. Alert rules + SLOs codified. Deployment of Prometheus + Alertmanager + paging provider remains PLANNED — operational, not a code gap.
| Capability | Score | Notes |
|---|---|---|
| Application metrics (Prometheus, NestJS) | GREEN | 10 cardinality-disciplined metrics. |
| Go-service metrics (ledger, gateway) | GREEN (Phase C + D) | gateway: lookup_* metrics from Phase C. ledger: rpc_requests_total{rpc,outcome} + rpc_latency_seconds{rpc} histogram + transaction-status-transitions + reconcile-drift gauge + entries-posted counter. Interceptor unit-tested. |
| Structured logging (Pino on TS, slog on Go) | GREEN | JSON, trace_id-injected. |
| PII redaction at logging layer | GREEN (Phase A3) | Pino redact config + slog ReplaceAttr. Correlation IDs preserved. |
| Request-ID propagation | YELLOW (improved Phase C) | Phase C added end-to-end correlation_id on LookupAccount. Generalising to every RPC is a follow-up. |
| OpenTelemetry tracing | YELLOW | SDK wired; no collector deployed. |
| Prometheus scrape config | GREEN (Phase D) | infra/prometheus/prometheus.yml reference template; loads alerts.yml. |
| Alert rules (Prometheus / Grafana) | YELLOW (Phase D) | infra/prometheus/alerts.yml LIVE as code with 15 rules covering every metric and 6 runbook links. Not yet loaded by a running Prometheus. |
| Dashboards (Grafana / Datadog / etc.) | RED | None in the repo. |
| SLOs and error budgets | GREEN (Phase D) | docs/architecture/SLOS_AND_ALERTING.md codifies every SLO + error budget + escalation path. |
| Long-term log retention | RED | No log shipper configured. |
| Trace retention | RED | No exporter destination. |
What changed Phase D: the "metrics exist + nothing fires on them" gap is now "metrics exist + rules + SLOs are written, awaiting deployment". The deployment cost is well-bounded (~5-10 days platform-team work) and pre-reviewed.
§2. Deployment + Release Engineering — RED
Headline: there is no production deployment story. docker-compose.yml runs on a developer laptop. That's the deploy surface.
| Capability | Score | Notes |
|---|---|---|
| Container images | YELLOW | apps/api/Dockerfile rebuilt to work — pinned pnpm@8.15.9, tini, openssl. Go service images implicitly built by compose. No images in a registry. |
| Kubernetes manifests / Helm charts | RED | None in infra/. |
| Terraform / Pulumi / CDK | RED | None. |
| Service-mesh / ingress controller config | RED | None. |
| TLS at the edge | RED | Local dev is HTTP only. SECURITY_CONTROLS.md §A.1 lists TLS 1.3 as Planned. |
| Blue-green / canary strategy | RED | No deploy pipeline. |
| Rollback strategy | RED | Implicitly "redeploy previous image"; no documented procedure. |
| Migration safety (forward-compatible, never-down) | YELLOW | Migrations apply clean; ALTER TYPE ADD VALUE is forward-safe; no prisma migrate diff check in CI; no pg_repack or low-lock strategy documented. |
| Migration runner in CI | RED | Migrations run by hand. |
| Health-check based readiness gating | YELLOW | /healthz + /readyz exist; no k8s probe config because there's no k8s. |
| CI: lint + test + build + e2e | GREEN | .github/workflows/ci.yml. |
| CI: SAST + dep-vuln scanning | RED | No Snyk / Dependabot / OSV-scanner. |
| CI: secret-leak scanning | RED | No gitleaks / trufflehog. |
| CI: container image scanning | RED | No Trivy / grype. |
| Artifact signing (cosign / Sigstore) | RED | No signed releases. |
| SBOM generation | RED | None. |
Implication: going from "verify-s4-recon.sh passes" to "deployed to staging that mimics production" is a 4-to-6 week dedicated workstream with one platform engineer. Treat that as the schedule.
§3. Database + Storage — YELLOW
Headline: schemas are good. Operations around them are absent.
| Capability | Score | Notes |
|---|---|---|
| Schema correctness (RLS, FORCE RLS, verify-guard, NUMERIC(20,0), append-only triggers) | GREEN | ADR-013 + ADR-005 + ADR-009 — all proven at the migration layer. |
| Connection pooling | YELLOW | API uses Prisma defaults; Go services use pgxpool with default 25. No pgbouncer. |
| Read replicas + read-routing | RED | Single primary assumed everywhere. |
| Backups (point-in-time recovery) | RED | None documented. No wal-g / pgBackRest / cloud-native PITR. |
| Restore drills | RED | Never tested. |
| Disaster recovery (off-region replica) | RED | None. |
| Database upgrades (PG16 → PG17) | RED | No upgrade plan. |
| Migration safety (concurrent index, transactional DDL) | YELLOW | Migrations are forward-compatible by convention; no checker enforces. |
| Slow-query observability | RED | No pg_stat_statements config; no slow-query log shipping. |
| Locking + deadlock observability | RED | None. |
| Tenant DB-resource quotas | RED | None — one noisy tenant can drown others. |
| Idempotency-record TTL cleanup | BLACK | No DELETE policy. Table will reach hundreds of millions of rows. Index bloat will silently degrade every money-moving POST. |
| Audit-entry TTL / archival | RED | No policy. Same growth concern (smaller per-event but unbounded). |
Implication: the first real production database load will surface the absence of pooling, replicas, backups, and TTLs. Plan the operational database surface in parallel with the platform itself.
§4. Secrets + Key Management — RED
Headline: every secret is an env var. There is no rotation story.
| Capability | Score | Notes |
|---|---|---|
| Secrets in env vars | YELLOW | LIVE in dev; production-unacceptable. |
| Vault / AWS SM / GCP SM / Azure KV | RED | None integrated. |
| Per-environment secret separation | RED | Single .env shape across dev/staging/prod assumed. |
| Bank HMAC signing key rotation | BLACK | No rotation runbook; no zero-downtime rotation (boot reads env once). |
| TLS cert lifecycle (request/renew/install) | RED | No certs in use. |
| Per-tenant data encryption keys (DEKs) | RED | SECURITY_CONTROLS.md §A.2 — Planned. |
| HSM-backed signing for ledger entries | RED | Planned. |
| Audit-log encryption | RED | Planned. |
| Pre-commit secret-leak hook | RED | None. |
Implication: the first quarterly key rotation will require an API restart and a coordinated handoff with the partner bank. Build a rotation runbook before the first key is shared.
§5. Authentication + Authorization — YELLOW (was BLACK pre-Phase A)
Headline: Phase A (2026-05-29) closed all three BLACK items. AuthZ now has real enforcement at three layers (guard + service + body force-set). Section moves from BLACK → YELLOW. Still YELLOW because: per-aggregate Employee ownership defers to Phase B; OrgRoleGuard does an uncached DB lookup per RBAC-protected route; platform-admin seeding is operator-driven; HTTP-E2E supertest harness pending.
| Capability | Score | Notes |
|---|---|---|
| better-auth email + password sign-in | GREEN | LIVE. |
| Email verification gate | YELLOW | Skipped in dev; enforced only when NODE_ENV=production. |
| Phone OTP plugin | YELLOW | Plugin wired; SMS sender is LoggingSmsSender — logs to stdout. In production this means OTP is never sent. |
| TOTP / WebAuthn 2FA | RED | twoFactor table exists; plugin not wired. |
| Session storage | GREEN | Prisma Session table. |
| Global AuthGuard | GREEN | APP_GUARD enforces req.user.id. |
| RBAC (role → permission → endpoint) | GREEN (Phase A2) | OrgRoleGuard enforces @RequireOrgRole/@RequirePlatformAdmin metadata via Member + AdminProfile lookups. RequestUser.role removed (single source of truth = Member.role). 11 unit tests cover every branch. |
| Per-aggregate ownership checks | YELLOW (Phase A1+A2) | x-actor-id header rejected with 400 by TenantContextMiddleware; actorId sourced from session. Audit log no longer poisonable. EWA/Lending temporarily restricted to admin/owner pending Employee.userId schema link (Phase B). |
| Tenant scoping on Business + Employee CRUD | GREEN (Phase A2) | Both controllers under SessionMiddleware + TenantContextMiddleware. Service-layer methods take tenantId as first param; EmployeeController force-sets businessId on writes; body mismatch → 403 TENANT_MISMATCH. |
| Tenant scoping on financial tables | GREEN | RLS + FORCE RLS + verify-guard. |
| Outbox publisher BYPASSRLS role | YELLOW | Gated on env OUTBOX_DATABASE_URL. Without it, drains zero rows silently (loud warning present). |
| Service-to-service auth (API → gateway, API → ledger) | RED | gRPC calls are plaintext, no mTLS, no token auth. |
| Webhook signature verification (HMAC) | GREEN | LIVE, 5-min clock skew, timing-safe equal. |
| Audit-log immutability | YELLOW | No UPDATE/DELETE triggers; app code respects it; not structurally guaranteed. |
| Audit-log poisonability | GREEN (Phase A1) | x-actor-id rejected with 400; actorId from session only. |
This section needs the most urgent attention. The three BLACK items are 1-to-3-day fixes individually and are blocking pilot.
§6. Incident Response + Operations — RED
Headline: four incident runbooks exist. Almost everything else doesn't.
| Capability | Score | Notes |
|---|---|---|
| Go-live runbooks (webhook, gateway-down, drift, statement-parse) | GREEN | Written, opinionated, with diagnosis branches. |
| Post-go-live runbook backlog | RED | 5+ runbooks scoped in runbooks/README.md; not written. |
| Routine-process runbooks (daily recon triage, dispute intake, reversal procedure) | RED | None. Reconciliation arch doc enumerates 8+ missing runbooks. |
| Paging integration (PagerDuty / Opsgenie) | RED | None. |
| On-call rota documented | RED | None. |
| Incident commander / scribe roles | RED | Undefined. |
| Postmortem template + cadence | RED | None. |
| Status page (customer-facing) | RED | None. |
| Customer communication templates | RED | None. |
| Internal admin tooling (replay webhook, force-resync, view account) | RED | None. Operators run psql against three databases. |
| Support ticketing integration | RED | No issue tracker linked. |
| Fraud-response playbook | RED | None. |
| AML/sanctions-hit playbook | RED | None. |
Implication: the first real incident on a real rail will be triaged by someone improvising. Every minute of that improvisation is brand damage with the partner bank.
§7. Risk + Fraud + Compliance — RED
Headline: the platform is one regulator conversation away from a difficult question for which there is no answer.
| Capability | Score | Notes |
|---|---|---|
| KYC (identity verification, document capture, liveness) | RED | packages/kyc/ does not exist. |
| Sanctions screening pre-disbursement | RED | None. |
| AML transaction monitoring | RED | None. |
| Suspicious-pattern detection | RED | None. |
| Velocity limits per (employee, employer, period) | RED | EWA has eligibility math; no velocity ceiling. |
| Device fingerprinting | RED | None. |
| Behavioural analytics | RED | None. |
| Per-partner exposure limits | RED | None. |
| Daily-volume circuit breaker | RED | None. |
| Regulatory reporting (NBE monthly returns, SAR/STR) | RED | None. |
| Right-to-be-forgotten / GDPR-style data export | RED | None. |
| Data-residency controls | RED | None — PII goes wherever Postgres goes. |
| ADR-014 ("DemozPay is orchestrator, not custodian") | RED | Recommended in action plan; not written. |
Implication: pilot conversations with NBE or with a partner bank's compliance team will ask for each row above. "Planned" is a roadmap claim, not a satisfactory answer.
§8. Scale + Performance — UNKNOWN (untested)
We have no production traffic numbers, no load test, no benchmark.
| Capability | Score | Notes |
|---|---|---|
| Load test (sustained 10x expected pilot traffic) | RED | Never run. |
| Stress test (find the breaking point) | RED | Never run. |
| Spike test (10x in 60s) | RED | Never run. |
| Soak test (24h sustained) | RED | Reconciliation soak (S4.6) is a logic soak, not a load soak. |
| Postgres saturation profile | RED | Unknown. |
| gRPC connection pool tuning | RED | Defaults. |
| HTTP keep-alive / connection-reuse tuning | RED | Defaults. |
| CDN / edge caching | RED | None. |
Implication: the first 100-employer pilot may surprise us in either direction. Build a load test BEFORE the pilot, not during it.
§9. Service-by-service score card
| Service | Today's score | Why |
|---|---|---|
services/ledger | YELLOW | LIVE primitive, no metrics, no replicas, no PITR. |
services/integration-gateway | YELLOW (improved from BLACK in §10 by Phase C) | LIVE happy path; LookupAccount LIVE for existence-check + typed-failure taxonomy + metrics surface (Phase C, 2026-05-29). Name-match deferred; circuit-breaker / retry policy missing. |
services/notifications | RED | Stub — /health only. |
services/bank-sandbox | GREEN (as a test harness) | Excellent for what it is. Not for production. |
apps/api | YELLOW | LIVE for EWA + lending happy path; AuthZ is BLACK. |
apps/admin-web | RED | Mock-only. |
apps/employer-web | RED | Mock-only. The highest-value frontend. |
apps/employee-web | BLACK | Mock-only with localStorage-based fake auth. Looks real, is theatre. |
apps/fi-web | YELLOW | Calls API; not end-to-end verified. |
apps/merchant-web | YELLOW | Calls API; not end-to-end verified. |
apps/docs-web | RED | Empty template. |
§10. The five-axis risk register
Ranked by likelihood × blast-radius, not by code surface.
Highest production risks
- AuthZ bypass via
x-actor-id— any logged-in user can act as anyone else. 1-day fix; blocks pilot. BLACK - Cross-tenant business/employee enumeration — any logged-in user reads every employer's roster. 1-day fix; blocks pilot. BLACK
- PII in logs — every error potentially logs national-IDs in plain text. 2-day fix; blocks pilot. BLACK
PARTIAL-CLOSED (Phase C, 2026-05-29) — existence-check ships LIVE: a disburse to a NON-EXISTENT account is now structurally impossible (use case rejects at 409 before any ledger / partner side effect). Name-match (different existing account) remains as Phase C continuation work — adapter returns the resolved name; use case doesn't compare it because the DTO carries no expected-name field. Blast radius reduced from unrecoverable to operator-actionable. YELLOW (was BLACK). |LookupAccountstub- No alerts on existing metrics — webhook failure, drift, gateway down all surface as silent counter bumps. Nobody is paged. 2-week fix (Grafana + Alertmanager + paging). RED
Highest operational risks
- No daily reconciliation cadence. Drift detection primitive exists; nothing runs it. RED
- No incident paging. Runbooks exist; nothing fires them. RED
- No admin tooling. Operators run psql against three databases. RED
- No partner key-rotation runbook. First rotation will be ad-hoc. RED
- No backups + no PITR. First disk failure is total. RED
Highest financial risks
- EWA cannot be repaid — disbursed obligations sit forever; month-end will surface millions of santim in unexplained receivables. BLACK
- Lending repayment is admin-driven — at scale a human cannot run thousands of installments. Without payroll, lending cannot grow past pilot. RED
- PayrollClearing → FI remittance does not happen — even when repayment is recorded, funds don't return to the FI. The phantom-asset will catch a regulator's eye. RED
- No velocity / daily ceiling per partner — a misconfiguration could initiate 10,000 disbursements before anyone notices. RED
- Idempotency-record TTL absent — quiet performance death over 6–12 months. YELLOW
Highest reconciliation risks
- No daily cadence + no dashboard — drift detection runs only when invoked manually. RED
- No statement-pull automation — humans drop files. A skipped day is invisible until drift accumulates. RED
- Partial-settlement event handler missing — bank deducts a correspondent fee; our reconciliation thinks settled-amount = initiated-amount; drift appears. RED
- Period-boundary timing rule unenforced — "don't finalize day N until day N+1's statement arrives" is a documented rule with no code. YELLOW
- Match-rate / flagged-rate not dashboarded — finance/ops fly blind. RED
Highest fraud risks
x-actor-idpoisoning — see AuthZ. Audit logs lie. BLACK- No sanctions screening — a partner-bank compliance hit on a transfer we made is a partnership-ending event. RED
- No velocity ceilings — single employee can extract maximum-eligible EWA in a single second. RED
- No device fingerprinting / no behavioural analytics — first credential-stuffing attack is undetectable. RED
- No fraud-response playbook — when fraud lands, response is improvised. RED
§11. The honest top-line
Updated 2026-05-29 after Phase A close-out:
Out of 8 readiness sections:
- GREEN: 0
- YELLOW: 4 (observability, DB, scale-unknown, AuthN/Z — promoted from BLACK by Phase A)
- RED: 3 (deploy, incident, risk/fraud/compliance)
- BLACK: 0 — Phase A closed every BLACK item (x-actor-id, cross-tenant CRUD, PII redaction)
- UNKNOWN: 1 (scale)
The platform is less production-dangerous than before Phase A but still not production-ready. The next dial-moves are Phase B (money correctness — EWA repayment + FI remittance) and Phase D (operational visibility — alerting + recon cadence). 90_DAY_EXECUTION_PLAN.md sequences the rest.
§12. Cross-references
- What blocks go-live →
GO_LIVE_BLOCKERS.md. - How we fix this in 90 days →
90_DAY_EXECUTION_PLAN.md. - Per-domain completeness →
DOMAIN_COMPLETENESS_MATRIX.md.