DemozPay — Production Readiness Audit

Snapshot: 2026-05-29 Author: Principal Architect / Acting CTO Companion to: REAL_SYSTEM_STATE.md, GO_LIVE_BLOCKERS.md. Scope: the operational surface — deploy, observe, recover, secure, support. Code correctness is covered elsewhere.

Scoring rubric

For each capability, four-bucket scoring:

Score	Meaning
GREEN	Production-acceptable today. Real evidence; passes a vendor audit.
YELLOW	Works in dev / a single happy path. Will not survive partial failure, scale, or a real incident.
RED	Not implemented. A first-week production incident will surface this.
BLACK	Implemented in a way that creates incident risk. Worse than absent.

The summary at the top of every section is the honest score; the table below it explains why.

§1. Observability — YELLOW (improved by Phase C + D)

Headline (updated Phase D, 2026-05-29): the metric surface is now full across API + gateway + ledger. Alert rules + SLOs codified. Deployment of Prometheus + Alertmanager + paging provider remains PLANNED — operational, not a code gap.

Capability	Score	Notes
Application metrics (Prometheus, NestJS)	GREEN	10 cardinality-disciplined metrics.
Go-service metrics (ledger, gateway)	GREEN (Phase C + D)	gateway: `lookup_` metrics from Phase C. ledger*: `rpc_requests_total{rpc,outcome}` + `rpc_latency_seconds{rpc}` histogram + transaction-status-transitions + reconcile-drift gauge + entries-posted counter. Interceptor unit-tested.
Structured logging (Pino on TS, slog on Go)	GREEN	JSON, trace_id-injected.
PII redaction at logging layer	GREEN (Phase A3)	Pino `redact` config + slog `ReplaceAttr`. Correlation IDs preserved.
Request-ID propagation	YELLOW (improved Phase C)	Phase C added end-to-end correlation_id on LookupAccount. Generalising to every RPC is a follow-up.
OpenTelemetry tracing	YELLOW	SDK wired; no collector deployed.
Prometheus scrape config	GREEN (Phase D)	`infra/prometheus/prometheus.yml` reference template; loads `alerts.yml`.
Alert rules (Prometheus / Grafana)	YELLOW (Phase D)	`infra/prometheus/alerts.yml` LIVE as code with 15 rules covering every metric and 6 runbook links. Not yet loaded by a running Prometheus.
Dashboards (Grafana / Datadog / etc.)	RED	None in the repo.
SLOs and error budgets	GREEN (Phase D)	`docs/architecture/SLOS_AND_ALERTING.md` codifies every SLO + error budget + escalation path.
Long-term log retention	RED	No log shipper configured.
Trace retention	RED	No exporter destination.

What changed Phase D: the "metrics exist + nothing fires on them" gap is now "metrics exist + rules + SLOs are written, awaiting deployment". The deployment cost is well-bounded (~5-10 days platform-team work) and pre-reviewed.

§2. Deployment + Release Engineering — RED

Headline: there is no production deployment story. docker-compose.yml runs on a developer laptop. That's the deploy surface.

Capability	Score	Notes
Container images	YELLOW	`apps/api/Dockerfile` rebuilt to work — pinned pnpm@8.15.9, tini, openssl. Go service images implicitly built by compose. No images in a registry.
Kubernetes manifests / Helm charts	RED	None in `infra/`.
Terraform / Pulumi / CDK	RED	None.
Service-mesh / ingress controller config	RED	None.
TLS at the edge	RED	Local dev is HTTP only. `SECURITY_CONTROLS.md` §A.1 lists TLS 1.3 as `Planned`.
Blue-green / canary strategy	RED	No deploy pipeline.
Rollback strategy	RED	Implicitly "redeploy previous image"; no documented procedure.
Migration safety (forward-compatible, never-down)	YELLOW	Migrations apply clean; `ALTER TYPE ADD VALUE` is forward-safe; no `prisma migrate diff` check in CI; no `pg_repack` or low-lock strategy documented.
Migration runner in CI	RED	Migrations run by hand.
Health-check based readiness gating	YELLOW	`/healthz` + `/readyz` exist; no k8s probe config because there's no k8s.
CI: lint + test + build + e2e	GREEN	`.github/workflows/ci.yml`.
CI: SAST + dep-vuln scanning	RED	No Snyk / Dependabot / OSV-scanner.
CI: secret-leak scanning	RED	No gitleaks / trufflehog.
CI: container image scanning	RED	No Trivy / grype.
Artifact signing (cosign / Sigstore)	RED	No signed releases.
SBOM generation	RED	None.

Implication: going from "verify-s4-recon.sh passes" to "deployed to staging that mimics production" is a 4-to-6 week dedicated workstream with one platform engineer. Treat that as the schedule.

§3. Database + Storage — YELLOW

Headline: schemas are good. Operations around them are absent.

Capability	Score	Notes
Schema correctness (RLS, FORCE RLS, verify-guard, NUMERIC(20,0), append-only triggers)	GREEN	ADR-013 + ADR-005 + ADR-009 — all proven at the migration layer.
Connection pooling	YELLOW	API uses Prisma defaults; Go services use pgxpool with default 25. No pgbouncer.
Read replicas + read-routing	RED	Single primary assumed everywhere.
Backups (point-in-time recovery)	RED	None documented. No `wal-g` / `pgBackRest` / cloud-native PITR.
Restore drills	RED	Never tested.
Disaster recovery (off-region replica)	RED	None.
Database upgrades (PG16 → PG17)	RED	No upgrade plan.
Migration safety (concurrent index, transactional DDL)	YELLOW	Migrations are forward-compatible by convention; no checker enforces.
Slow-query observability	RED	No `pg_stat_statements` config; no slow-query log shipping.
Locking + deadlock observability	RED	None.
Tenant DB-resource quotas	RED	None — one noisy tenant can drown others.
Idempotency-record TTL cleanup	BLACK	No DELETE policy. Table will reach hundreds of millions of rows. Index bloat will silently degrade every money-moving POST.
Audit-entry TTL / archival	RED	No policy. Same growth concern (smaller per-event but unbounded).

Implication: the first real production database load will surface the absence of pooling, replicas, backups, and TTLs. Plan the operational database surface in parallel with the platform itself.

§4. Secrets + Key Management — RED

Headline: every secret is an env var. There is no rotation story.

Capability	Score	Notes
Secrets in env vars	YELLOW	LIVE in dev; production-unacceptable.
Vault / AWS SM / GCP SM / Azure KV	RED	None integrated.
Per-environment secret separation	RED	Single `.env` shape across dev/staging/prod assumed.
Bank HMAC signing key rotation	BLACK	No rotation runbook; no zero-downtime rotation (boot reads env once).
TLS cert lifecycle (request/renew/install)	RED	No certs in use.
Per-tenant data encryption keys (DEKs)	RED	`SECURITY_CONTROLS.md` §A.2 — `Planned`.
HSM-backed signing for ledger entries	RED	`Planned`.
Audit-log encryption	RED	`Planned`.
Pre-commit secret-leak hook	RED	None.

Implication: the first quarterly key rotation will require an API restart and a coordinated handoff with the partner bank. Build a rotation runbook before the first key is shared.

§5. Authentication + Authorization — YELLOW (was BLACK pre-Phase A)

Headline: Phase A (2026-05-29) closed all three BLACK items. AuthZ now has real enforcement at three layers (guard + service + body force-set). Section moves from BLACK → YELLOW. Still YELLOW because: per-aggregate Employee ownership defers to Phase B; OrgRoleGuard does an uncached DB lookup per RBAC-protected route; platform-admin seeding is operator-driven; HTTP-E2E supertest harness pending.

Capability	Score	Notes
better-auth email + password sign-in	GREEN	LIVE.
Email verification gate	YELLOW	Skipped in dev; enforced only when `NODE_ENV=production`.
Phone OTP plugin	YELLOW	Plugin wired; SMS sender is `LoggingSmsSender` — logs to stdout. In production this means OTP is never sent.
TOTP / WebAuthn 2FA	RED	`twoFactor` table exists; plugin not wired.
Session storage	GREEN	Prisma Session table.
Global AuthGuard	GREEN	`APP_GUARD` enforces `req.user.id`.
RBAC (role → permission → endpoint)	GREEN (Phase A2)	`OrgRoleGuard` enforces `@RequireOrgRole`/`@RequirePlatformAdmin` metadata via Member + AdminProfile lookups. `RequestUser.role` removed (single source of truth = Member.role). 11 unit tests cover every branch.
Per-aggregate ownership checks	YELLOW (Phase A1+A2)	`x-actor-id` header rejected with 400 by `TenantContextMiddleware`; `actorId` sourced from session. Audit log no longer poisonable. EWA/Lending temporarily restricted to admin/owner pending `Employee.userId` schema link (Phase B).
Tenant scoping on Business + Employee CRUD	GREEN (Phase A2)	Both controllers under `SessionMiddleware + TenantContextMiddleware`. Service-layer methods take `tenantId` as first param; `EmployeeController` force-sets `businessId` on writes; body mismatch → 403 `TENANT_MISMATCH`.
Tenant scoping on financial tables	GREEN	RLS + FORCE RLS + verify-guard.
Outbox publisher BYPASSRLS role	YELLOW	Gated on env `OUTBOX_DATABASE_URL`. Without it, drains zero rows silently (loud warning present).
Service-to-service auth (API → gateway, API → ledger)	RED	gRPC calls are plaintext, no mTLS, no token auth.
Webhook signature verification (HMAC)	GREEN	LIVE, 5-min clock skew, timing-safe equal.
Audit-log immutability	YELLOW	No UPDATE/DELETE triggers; app code respects it; not structurally guaranteed.
Audit-log poisonability	GREEN (Phase A1)	`x-actor-id` rejected with 400; actorId from session only.

This section needs the most urgent attention. The three BLACK items are 1-to-3-day fixes individually and are blocking pilot.

§6. Incident Response + Operations — RED

Headline: four incident runbooks exist. Almost everything else doesn't.

Capability	Score	Notes
Go-live runbooks (webhook, gateway-down, drift, statement-parse)	GREEN	Written, opinionated, with diagnosis branches.
Post-go-live runbook backlog	RED	5+ runbooks scoped in `runbooks/README.md`; not written.
Routine-process runbooks (daily recon triage, dispute intake, reversal procedure)	RED	None. Reconciliation arch doc enumerates 8+ missing runbooks.
Paging integration (PagerDuty / Opsgenie)	RED	None.
On-call rota documented	RED	None.
Incident commander / scribe roles	RED	Undefined.
Postmortem template + cadence	RED	None.
Status page (customer-facing)	RED	None.
Customer communication templates	RED	None.
Internal admin tooling (replay webhook, force-resync, view account)	RED	None. Operators run psql against three databases.
Support ticketing integration	RED	No issue tracker linked.
Fraud-response playbook	RED	None.
AML/sanctions-hit playbook	RED	None.

Implication: the first real incident on a real rail will be triaged by someone improvising. Every minute of that improvisation is brand damage with the partner bank.

§7. Risk + Fraud + Compliance — RED

Headline: the platform is one regulator conversation away from a difficult question for which there is no answer.

Capability	Score	Notes
KYC (identity verification, document capture, liveness)	RED	`packages/kyc/` does not exist.
Sanctions screening pre-disbursement	RED	None.
AML transaction monitoring	RED	None.
Suspicious-pattern detection	RED	None.
Velocity limits per (employee, employer, period)	RED	EWA has eligibility math; no velocity ceiling.
Device fingerprinting	RED	None.
Behavioural analytics	RED	None.
Per-partner exposure limits	RED	None.
Daily-volume circuit breaker	RED	None.
Regulatory reporting (NBE monthly returns, SAR/STR)	RED	None.
Right-to-be-forgotten / GDPR-style data export	RED	None.
Data-residency controls	RED	None — PII goes wherever Postgres goes.
ADR-014 ("DemozPay is orchestrator, not custodian")	RED	Recommended in action plan; not written.

Implication: pilot conversations with NBE or with a partner bank's compliance team will ask for each row above. "Planned" is a roadmap claim, not a satisfactory answer.

§8. Scale + Performance — UNKNOWN (untested)

We have no production traffic numbers, no load test, no benchmark.

Capability	Score	Notes
Load test (sustained 10x expected pilot traffic)	RED	Never run.
Stress test (find the breaking point)	RED	Never run.
Spike test (10x in 60s)	RED	Never run.
Soak test (24h sustained)	RED	Reconciliation soak (S4.6) is a logic soak, not a load soak.
Postgres saturation profile	RED	Unknown.
gRPC connection pool tuning	RED	Defaults.
HTTP keep-alive / connection-reuse tuning	RED	Defaults.
CDN / edge caching	RED	None.

Implication: the first 100-employer pilot may surprise us in either direction. Build a load test BEFORE the pilot, not during it.

§9. Service-by-service score card

Service	Today's score	Why
`services/ledger`	YELLOW	LIVE primitive, no metrics, no replicas, no PITR.
`services/integration-gateway`	YELLOW (improved from BLACK in §10 by Phase C)	LIVE happy path; `LookupAccount` LIVE for existence-check + typed-failure taxonomy + metrics surface (Phase C, 2026-05-29). Name-match deferred; circuit-breaker / retry policy missing.
`services/notifications`	RED	Stub — `/health` only.
`services/bank-sandbox`	GREEN (as a test harness)	Excellent for what it is. Not for production.
`apps/api`	YELLOW	LIVE for EWA + lending happy path; AuthZ is BLACK.
`apps/admin-web`	RED	Mock-only.
`apps/employer-web`	RED	Mock-only. The highest-value frontend.
`apps/employee-web`	BLACK	Mock-only with `localStorage`-based fake auth. Looks real, is theatre.
`apps/fi-web`	YELLOW	Calls API; not end-to-end verified.
`apps/merchant-web`	YELLOW	Calls API; not end-to-end verified.
`apps/docs-web`	RED	Empty template.

§10. The five-axis risk register

Ranked by likelihood × blast-radius, not by code surface.

Highest production risks

AuthZ bypass via x-actor-id — any logged-in user can act as anyone else. 1-day fix; blocks pilot. BLACK
Cross-tenant business/employee enumeration — any logged-in user reads every employer's roster. 1-day fix; blocks pilot. BLACK
PII in logs — every error potentially logs national-IDs in plain text. 2-day fix; blocks pilot. BLACK
~~LookupAccount stub~~ PARTIAL-CLOSED (Phase C, 2026-05-29) — existence-check ships LIVE: a disburse to a NON-EXISTENT account is now structurally impossible (use case rejects at 409 before any ledger / partner side effect). Name-match (different existing account) remains as Phase C continuation work — adapter returns the resolved name; use case doesn't compare it because the DTO carries no expected-name field. Blast radius reduced from unrecoverable to operator-actionable. YELLOW (was BLACK). |
No alerts on existing metrics — webhook failure, drift, gateway down all surface as silent counter bumps. Nobody is paged. 2-week fix (Grafana + Alertmanager + paging). RED

Highest operational risks

No daily reconciliation cadence. Drift detection primitive exists; nothing runs it. RED
No incident paging. Runbooks exist; nothing fires them. RED
No admin tooling. Operators run psql against three databases. RED
No partner key-rotation runbook. First rotation will be ad-hoc. RED
No backups + no PITR. First disk failure is total. RED

Highest financial risks

EWA cannot be repaid — disbursed obligations sit forever; month-end will surface millions of santim in unexplained receivables. BLACK
Lending repayment is admin-driven — at scale a human cannot run thousands of installments. Without payroll, lending cannot grow past pilot. RED
PayrollClearing → FI remittance does not happen — even when repayment is recorded, funds don't return to the FI. The phantom-asset will catch a regulator's eye. RED
No velocity / daily ceiling per partner — a misconfiguration could initiate 10,000 disbursements before anyone notices. RED
Idempotency-record TTL absent — quiet performance death over 6–12 months. YELLOW

Highest reconciliation risks

No daily cadence + no dashboard — drift detection runs only when invoked manually. RED
No statement-pull automation — humans drop files. A skipped day is invisible until drift accumulates. RED
Partial-settlement event handler missing — bank deducts a correspondent fee; our reconciliation thinks settled-amount = initiated-amount; drift appears. RED
Period-boundary timing rule unenforced — "don't finalize day N until day N+1's statement arrives" is a documented rule with no code. YELLOW
Match-rate / flagged-rate not dashboarded — finance/ops fly blind. RED

Highest fraud risks

x-actor-id poisoning — see AuthZ. Audit logs lie. BLACK
No sanctions screening — a partner-bank compliance hit on a transfer we made is a partnership-ending event. RED
No velocity ceilings — single employee can extract maximum-eligible EWA in a single second. RED
No device fingerprinting / no behavioural analytics — first credential-stuffing attack is undetectable. RED
No fraud-response playbook — when fraud lands, response is improvised. RED

§11. The honest top-line

Updated 2026-05-29 after Phase A close-out:

Out of 8 readiness sections:

GREEN: 0
YELLOW: 4 (observability, DB, scale-unknown, AuthN/Z — promoted from BLACK by Phase A)
RED: 3 (deploy, incident, risk/fraud/compliance)
BLACK: 0 — Phase A closed every BLACK item (x-actor-id, cross-tenant CRUD, PII redaction)
UNKNOWN: 1 (scale)

The platform is less production-dangerous than before Phase A but still not production-ready. The next dial-moves are Phase B (money correctness — EWA repayment + FI remittance) and Phase D (operational visibility — alerting + recon cadence). 90_DAY_EXECUTION_PLAN.md sequences the rest.

§12. Cross-references

What blocks go-live → GO_LIVE_BLOCKERS.md.
How we fix this in 90 days → 90_DAY_EXECUTION_PLAN.md.
Per-domain completeness → DOMAIN_COMPLETENESS_MATRIX.md.

Scoring rubric​

§1. Observability — YELLOW (improved by Phase C + D)​

§2. Deployment + Release Engineering — RED​

§3. Database + Storage — YELLOW​

§4. Secrets + Key Management — RED​

§5. Authentication + Authorization — YELLOW (was BLACK pre-Phase A)​

§6. Incident Response + Operations — RED​

§7. Risk + Fraud + Compliance — RED​

§8. Scale + Performance — UNKNOWN (untested)​

§9. Service-by-service score card​

§10. The five-axis risk register​

Highest production risks​

Highest operational risks​

Highest financial risks​

Highest reconciliation risks​

Highest fraud risks​

§11. The honest top-line​

§12. Cross-references​