Skip to main content

DemozPay — Production Readiness Audit

Snapshot: 2026-05-29 Author: Principal Architect / Acting CTO Companion to: REAL_SYSTEM_STATE.md, GO_LIVE_BLOCKERS.md. Scope: the operational surface — deploy, observe, recover, secure, support. Code correctness is covered elsewhere.

Scoring rubric

For each capability, four-bucket scoring:

ScoreMeaning
GREENProduction-acceptable today. Real evidence; passes a vendor audit.
YELLOWWorks in dev / a single happy path. Will not survive partial failure, scale, or a real incident.
REDNot implemented. A first-week production incident will surface this.
BLACKImplemented in a way that creates incident risk. Worse than absent.

The summary at the top of every section is the honest score; the table below it explains why.


§1. Observability — YELLOW (improved by Phase C + D)

Headline (updated Phase D, 2026-05-29): the metric surface is now full across API + gateway + ledger. Alert rules + SLOs codified. Deployment of Prometheus + Alertmanager + paging provider remains PLANNED — operational, not a code gap.

CapabilityScoreNotes
Application metrics (Prometheus, NestJS)GREEN10 cardinality-disciplined metrics.
Go-service metrics (ledger, gateway)GREEN (Phase C + D)gateway: lookup_* metrics from Phase C. ledger: rpc_requests_total{rpc,outcome} + rpc_latency_seconds{rpc} histogram + transaction-status-transitions + reconcile-drift gauge + entries-posted counter. Interceptor unit-tested.
Structured logging (Pino on TS, slog on Go)GREENJSON, trace_id-injected.
PII redaction at logging layerGREEN (Phase A3)Pino redact config + slog ReplaceAttr. Correlation IDs preserved.
Request-ID propagationYELLOW (improved Phase C)Phase C added end-to-end correlation_id on LookupAccount. Generalising to every RPC is a follow-up.
OpenTelemetry tracingYELLOWSDK wired; no collector deployed.
Prometheus scrape configGREEN (Phase D)infra/prometheus/prometheus.yml reference template; loads alerts.yml.
Alert rules (Prometheus / Grafana)YELLOW (Phase D)infra/prometheus/alerts.yml LIVE as code with 15 rules covering every metric and 6 runbook links. Not yet loaded by a running Prometheus.
Dashboards (Grafana / Datadog / etc.)REDNone in the repo.
SLOs and error budgetsGREEN (Phase D)docs/architecture/SLOS_AND_ALERTING.md codifies every SLO + error budget + escalation path.
Long-term log retentionREDNo log shipper configured.
Trace retentionREDNo exporter destination.

What changed Phase D: the "metrics exist + nothing fires on them" gap is now "metrics exist + rules + SLOs are written, awaiting deployment". The deployment cost is well-bounded (~5-10 days platform-team work) and pre-reviewed.


§2. Deployment + Release Engineering — RED

Headline: there is no production deployment story. docker-compose.yml runs on a developer laptop. That's the deploy surface.

CapabilityScoreNotes
Container imagesYELLOWapps/api/Dockerfile rebuilt to work — pinned pnpm@8.15.9, tini, openssl. Go service images implicitly built by compose. No images in a registry.
Kubernetes manifests / Helm chartsREDNone in infra/.
Terraform / Pulumi / CDKREDNone.
Service-mesh / ingress controller configREDNone.
TLS at the edgeREDLocal dev is HTTP only. SECURITY_CONTROLS.md §A.1 lists TLS 1.3 as Planned.
Blue-green / canary strategyREDNo deploy pipeline.
Rollback strategyREDImplicitly "redeploy previous image"; no documented procedure.
Migration safety (forward-compatible, never-down)YELLOWMigrations apply clean; ALTER TYPE ADD VALUE is forward-safe; no prisma migrate diff check in CI; no pg_repack or low-lock strategy documented.
Migration runner in CIREDMigrations run by hand.
Health-check based readiness gatingYELLOW/healthz + /readyz exist; no k8s probe config because there's no k8s.
CI: lint + test + build + e2eGREEN.github/workflows/ci.yml.
CI: SAST + dep-vuln scanningREDNo Snyk / Dependabot / OSV-scanner.
CI: secret-leak scanningREDNo gitleaks / trufflehog.
CI: container image scanningREDNo Trivy / grype.
Artifact signing (cosign / Sigstore)REDNo signed releases.
SBOM generationREDNone.

Implication: going from "verify-s4-recon.sh passes" to "deployed to staging that mimics production" is a 4-to-6 week dedicated workstream with one platform engineer. Treat that as the schedule.


§3. Database + Storage — YELLOW

Headline: schemas are good. Operations around them are absent.

CapabilityScoreNotes
Schema correctness (RLS, FORCE RLS, verify-guard, NUMERIC(20,0), append-only triggers)GREENADR-013 + ADR-005 + ADR-009 — all proven at the migration layer.
Connection poolingYELLOWAPI uses Prisma defaults; Go services use pgxpool with default 25. No pgbouncer.
Read replicas + read-routingREDSingle primary assumed everywhere.
Backups (point-in-time recovery)REDNone documented. No wal-g / pgBackRest / cloud-native PITR.
Restore drillsREDNever tested.
Disaster recovery (off-region replica)REDNone.
Database upgrades (PG16 → PG17)REDNo upgrade plan.
Migration safety (concurrent index, transactional DDL)YELLOWMigrations are forward-compatible by convention; no checker enforces.
Slow-query observabilityREDNo pg_stat_statements config; no slow-query log shipping.
Locking + deadlock observabilityREDNone.
Tenant DB-resource quotasREDNone — one noisy tenant can drown others.
Idempotency-record TTL cleanupBLACKNo DELETE policy. Table will reach hundreds of millions of rows. Index bloat will silently degrade every money-moving POST.
Audit-entry TTL / archivalREDNo policy. Same growth concern (smaller per-event but unbounded).

Implication: the first real production database load will surface the absence of pooling, replicas, backups, and TTLs. Plan the operational database surface in parallel with the platform itself.


§4. Secrets + Key Management — RED

Headline: every secret is an env var. There is no rotation story.

CapabilityScoreNotes
Secrets in env varsYELLOWLIVE in dev; production-unacceptable.
Vault / AWS SM / GCP SM / Azure KVREDNone integrated.
Per-environment secret separationREDSingle .env shape across dev/staging/prod assumed.
Bank HMAC signing key rotationBLACKNo rotation runbook; no zero-downtime rotation (boot reads env once).
TLS cert lifecycle (request/renew/install)REDNo certs in use.
Per-tenant data encryption keys (DEKs)REDSECURITY_CONTROLS.md §A.2 — Planned.
HSM-backed signing for ledger entriesREDPlanned.
Audit-log encryptionREDPlanned.
Pre-commit secret-leak hookREDNone.

Implication: the first quarterly key rotation will require an API restart and a coordinated handoff with the partner bank. Build a rotation runbook before the first key is shared.


§5. Authentication + Authorization — YELLOW (was BLACK pre-Phase A)

Headline: Phase A (2026-05-29) closed all three BLACK items. AuthZ now has real enforcement at three layers (guard + service + body force-set). Section moves from BLACK → YELLOW. Still YELLOW because: per-aggregate Employee ownership defers to Phase B; OrgRoleGuard does an uncached DB lookup per RBAC-protected route; platform-admin seeding is operator-driven; HTTP-E2E supertest harness pending.

CapabilityScoreNotes
better-auth email + password sign-inGREENLIVE.
Email verification gateYELLOWSkipped in dev; enforced only when NODE_ENV=production.
Phone OTP pluginYELLOWPlugin wired; SMS sender is LoggingSmsSender — logs to stdout. In production this means OTP is never sent.
TOTP / WebAuthn 2FAREDtwoFactor table exists; plugin not wired.
Session storageGREENPrisma Session table.
Global AuthGuardGREENAPP_GUARD enforces req.user.id.
RBAC (role → permission → endpoint)GREEN (Phase A2)OrgRoleGuard enforces @RequireOrgRole/@RequirePlatformAdmin metadata via Member + AdminProfile lookups. RequestUser.role removed (single source of truth = Member.role). 11 unit tests cover every branch.
Per-aggregate ownership checksYELLOW (Phase A1+A2)x-actor-id header rejected with 400 by TenantContextMiddleware; actorId sourced from session. Audit log no longer poisonable. EWA/Lending temporarily restricted to admin/owner pending Employee.userId schema link (Phase B).
Tenant scoping on Business + Employee CRUDGREEN (Phase A2)Both controllers under SessionMiddleware + TenantContextMiddleware. Service-layer methods take tenantId as first param; EmployeeController force-sets businessId on writes; body mismatch → 403 TENANT_MISMATCH.
Tenant scoping on financial tablesGREENRLS + FORCE RLS + verify-guard.
Outbox publisher BYPASSRLS roleYELLOWGated on env OUTBOX_DATABASE_URL. Without it, drains zero rows silently (loud warning present).
Service-to-service auth (API → gateway, API → ledger)REDgRPC calls are plaintext, no mTLS, no token auth.
Webhook signature verification (HMAC)GREENLIVE, 5-min clock skew, timing-safe equal.
Audit-log immutabilityYELLOWNo UPDATE/DELETE triggers; app code respects it; not structurally guaranteed.
Audit-log poisonabilityGREEN (Phase A1)x-actor-id rejected with 400; actorId from session only.

This section needs the most urgent attention. The three BLACK items are 1-to-3-day fixes individually and are blocking pilot.


§6. Incident Response + Operations — RED

Headline: four incident runbooks exist. Almost everything else doesn't.

CapabilityScoreNotes
Go-live runbooks (webhook, gateway-down, drift, statement-parse)GREENWritten, opinionated, with diagnosis branches.
Post-go-live runbook backlogRED5+ runbooks scoped in runbooks/README.md; not written.
Routine-process runbooks (daily recon triage, dispute intake, reversal procedure)REDNone. Reconciliation arch doc enumerates 8+ missing runbooks.
Paging integration (PagerDuty / Opsgenie)REDNone.
On-call rota documentedREDNone.
Incident commander / scribe rolesREDUndefined.
Postmortem template + cadenceREDNone.
Status page (customer-facing)REDNone.
Customer communication templatesREDNone.
Internal admin tooling (replay webhook, force-resync, view account)REDNone. Operators run psql against three databases.
Support ticketing integrationREDNo issue tracker linked.
Fraud-response playbookREDNone.
AML/sanctions-hit playbookREDNone.

Implication: the first real incident on a real rail will be triaged by someone improvising. Every minute of that improvisation is brand damage with the partner bank.


§7. Risk + Fraud + Compliance — RED

Headline: the platform is one regulator conversation away from a difficult question for which there is no answer.

CapabilityScoreNotes
KYC (identity verification, document capture, liveness)REDpackages/kyc/ does not exist.
Sanctions screening pre-disbursementREDNone.
AML transaction monitoringREDNone.
Suspicious-pattern detectionREDNone.
Velocity limits per (employee, employer, period)REDEWA has eligibility math; no velocity ceiling.
Device fingerprintingREDNone.
Behavioural analyticsREDNone.
Per-partner exposure limitsREDNone.
Daily-volume circuit breakerREDNone.
Regulatory reporting (NBE monthly returns, SAR/STR)REDNone.
Right-to-be-forgotten / GDPR-style data exportREDNone.
Data-residency controlsREDNone — PII goes wherever Postgres goes.
ADR-014 ("DemozPay is orchestrator, not custodian")REDRecommended in action plan; not written.

Implication: pilot conversations with NBE or with a partner bank's compliance team will ask for each row above. "Planned" is a roadmap claim, not a satisfactory answer.


§8. Scale + Performance — UNKNOWN (untested)

We have no production traffic numbers, no load test, no benchmark.

CapabilityScoreNotes
Load test (sustained 10x expected pilot traffic)REDNever run.
Stress test (find the breaking point)REDNever run.
Spike test (10x in 60s)REDNever run.
Soak test (24h sustained)REDReconciliation soak (S4.6) is a logic soak, not a load soak.
Postgres saturation profileREDUnknown.
gRPC connection pool tuningREDDefaults.
HTTP keep-alive / connection-reuse tuningREDDefaults.
CDN / edge cachingREDNone.

Implication: the first 100-employer pilot may surprise us in either direction. Build a load test BEFORE the pilot, not during it.


§9. Service-by-service score card

ServiceToday's scoreWhy
services/ledgerYELLOWLIVE primitive, no metrics, no replicas, no PITR.
services/integration-gatewayYELLOW (improved from BLACK in §10 by Phase C)LIVE happy path; LookupAccount LIVE for existence-check + typed-failure taxonomy + metrics surface (Phase C, 2026-05-29). Name-match deferred; circuit-breaker / retry policy missing.
services/notificationsREDStub — /health only.
services/bank-sandboxGREEN (as a test harness)Excellent for what it is. Not for production.
apps/apiYELLOWLIVE for EWA + lending happy path; AuthZ is BLACK.
apps/admin-webREDMock-only.
apps/employer-webREDMock-only. The highest-value frontend.
apps/employee-webBLACKMock-only with localStorage-based fake auth. Looks real, is theatre.
apps/fi-webYELLOWCalls API; not end-to-end verified.
apps/merchant-webYELLOWCalls API; not end-to-end verified.
apps/docs-webREDEmpty template.

§10. The five-axis risk register

Ranked by likelihood × blast-radius, not by code surface.

Highest production risks

  1. AuthZ bypass via x-actor-id — any logged-in user can act as anyone else. 1-day fix; blocks pilot. BLACK
  2. Cross-tenant business/employee enumeration — any logged-in user reads every employer's roster. 1-day fix; blocks pilot. BLACK
  3. PII in logs — every error potentially logs national-IDs in plain text. 2-day fix; blocks pilot. BLACK
  4. LookupAccount stub PARTIAL-CLOSED (Phase C, 2026-05-29) — existence-check ships LIVE: a disburse to a NON-EXISTENT account is now structurally impossible (use case rejects at 409 before any ledger / partner side effect). Name-match (different existing account) remains as Phase C continuation work — adapter returns the resolved name; use case doesn't compare it because the DTO carries no expected-name field. Blast radius reduced from unrecoverable to operator-actionable. YELLOW (was BLACK). |
  5. No alerts on existing metrics — webhook failure, drift, gateway down all surface as silent counter bumps. Nobody is paged. 2-week fix (Grafana + Alertmanager + paging). RED

Highest operational risks

  1. No daily reconciliation cadence. Drift detection primitive exists; nothing runs it. RED
  2. No incident paging. Runbooks exist; nothing fires them. RED
  3. No admin tooling. Operators run psql against three databases. RED
  4. No partner key-rotation runbook. First rotation will be ad-hoc. RED
  5. No backups + no PITR. First disk failure is total. RED

Highest financial risks

  1. EWA cannot be repaid — disbursed obligations sit forever; month-end will surface millions of santim in unexplained receivables. BLACK
  2. Lending repayment is admin-driven — at scale a human cannot run thousands of installments. Without payroll, lending cannot grow past pilot. RED
  3. PayrollClearing → FI remittance does not happen — even when repayment is recorded, funds don't return to the FI. The phantom-asset will catch a regulator's eye. RED
  4. No velocity / daily ceiling per partner — a misconfiguration could initiate 10,000 disbursements before anyone notices. RED
  5. Idempotency-record TTL absent — quiet performance death over 6–12 months. YELLOW

Highest reconciliation risks

  1. No daily cadence + no dashboard — drift detection runs only when invoked manually. RED
  2. No statement-pull automation — humans drop files. A skipped day is invisible until drift accumulates. RED
  3. Partial-settlement event handler missing — bank deducts a correspondent fee; our reconciliation thinks settled-amount = initiated-amount; drift appears. RED
  4. Period-boundary timing rule unenforced — "don't finalize day N until day N+1's statement arrives" is a documented rule with no code. YELLOW
  5. Match-rate / flagged-rate not dashboarded — finance/ops fly blind. RED

Highest fraud risks

  1. x-actor-id poisoning — see AuthZ. Audit logs lie. BLACK
  2. No sanctions screening — a partner-bank compliance hit on a transfer we made is a partnership-ending event. RED
  3. No velocity ceilings — single employee can extract maximum-eligible EWA in a single second. RED
  4. No device fingerprinting / no behavioural analytics — first credential-stuffing attack is undetectable. RED
  5. No fraud-response playbook — when fraud lands, response is improvised. RED

§11. The honest top-line

Updated 2026-05-29 after Phase A close-out:

Out of 8 readiness sections:

  • GREEN: 0
  • YELLOW: 4 (observability, DB, scale-unknown, AuthN/Z — promoted from BLACK by Phase A)
  • RED: 3 (deploy, incident, risk/fraud/compliance)
  • BLACK: 0 — Phase A closed every BLACK item (x-actor-id, cross-tenant CRUD, PII redaction)
  • UNKNOWN: 1 (scale)

The platform is less production-dangerous than before Phase A but still not production-ready. The next dial-moves are Phase B (money correctness — EWA repayment + FI remittance) and Phase D (operational visibility — alerting + recon cadence). 90_DAY_EXECUTION_PLAN.md sequences the rest.

§12. Cross-references