Skip to main content

DemozPay — Go-Live Blockers

Snapshot: 2026-05-29 Companion to: REAL_SYSTEM_STATE.md, PRODUCTION_READINESS.md, 90_DAY_EXECUTION_PLAN.md. Purpose: the hard list of every item that MUST be true before a real bank rail opens to real users. Every item is a gate, not a nice-to-have.

How this document is used

If any item on this list is open, the platform does not go live. No pilot. No "small employer." No "5 employees." No.

The S4.6 gate (7-day drift-clean soak) was a code-completion gate. It is necessary but not sufficient. This list adds the rest.

The list is deliberately short. It catalogues blockers, not aspirations.

Categories

  • §1 — Authorization / authentication. These are 1-to-3-day fixes. They are pilot-blocking because they create regulatory + privacy + brand risk in the first week.
  • §2 — Money correctness. Without these the ledger lies.
  • §3 — Operational visibility. Without these no one knows when reality and record disagree.
  • §4 — Compliance + identity. Without these regulators don't sign off.
  • §5 — Deploy + safety. Without these the deployment itself is a risk.
  • §6 — Operational human-process. Without these incidents don't get triaged.

Each entry has:

  • A unique ID GL-NN.
  • A 1-line description.
  • The acceptance criterion.
  • An effort estimate.
  • The blast radius if shipped without it.

§1. Authorization / authentication (3 items)

GL-01 — Validate x-actor-id against req.user.idCLOSED (Phase A1, 2026-05-29)

Status: LIVE. Verified by apps/api/src/identity/tenant/tenant-context.middleware.spec.ts (6 tests incl. spoof rejection with matching and non-matching values) and full test suite green (102/102). The deprecated header is now REJECTED with 400 + warn log; actorId is sourced from req.user.id via AsyncLocalStorage; system call-sites use namespaced 'system:<name>' literals.

Description: the EWA + lending controllers read x-actor-id from a request header and pass it through to use cases as the actorId for audit. Currently no validation that the header matches the authenticated principal.

Acceptance: any request where x-actor-id !== req.user.id returns 403 Forbidden. Audit row's actorId is always req.user.id, never a header value. Unit test covers the spoof attempt.

Effort: 0.5 day.

Blast radius if not fixed: any authenticated user can disburse another user's EWA and have audit logs blame the spoofed actor. Pilot-blocking.

Files: packages/ewa/backend/presentation/ewa.controller.ts:70, packages/lending/backend/presentation/lending.controller.ts:72.


GL-02 — Tenant-scope the Business + Employee controllers — CLOSED (Phase A2, 2026-05-29)

Status: LIVE at the unit + DI level. Three layers of defense: (1) OrgRoleGuard rejects cross-tenant access (11 unit tests cover every branch); (2) service-layer queries take tenantId as the first param — cross-tenant lookups impossible; (3) EmployeeController force-sets businessId from session and returns 403 TENANT_MISMATCH on body manipulation. EWA + Lending temporarily restricted to admin/owner per D3 pending Employee.userId schema link (Phase B). HTTP-E2E supertest harness for V3/V4 is deferred to Phase B as operational tooling.

Description: BusinessController + EmployeeController are not registered under TenantContextMiddleware. Business has no tenantId (it IS the tenant). Any authenticated user can enumerate every employer and every employer's employees.

Acceptance: GET /api/business returns only the businesses the authenticated user is a member of (via better-auth Member table). GET /api/employees returns only the employees of the user's active business. POST /api/employees rejects if businessId doesn't match the active business. End-to-end test from a second tenant proves zero-row visibility.

Effort: 1 day.

Blast radius: privacy incident + regulator incident. Pilot-blocking.

Files: apps/api/src/app/app.module.ts:54-64, apps/api/src/business/business.controller.ts, apps/api/src/workforce/employee/employee.controller.ts.


GL-03 — Strip PII from logs — CLOSED (Phase A3, 2026-05-29)

Status: LIVE. Pino redact config on apps/api (21 explicit paths + 6 wildcards); slog ReplaceAttr filter on services/ledger + services/integration-gateway. Critically: correlation IDs (tx_id, partner_reference, idempotency_key, tenant_id, account_id, disbursement_id) explicitly pass through — positively tested. Webhook raw_body redacted at gateway. 17 TS tests + 29 Go test cases all green. maskEmail/maskPhone/maskNationalId/maskLastN helpers shipped in @demoz-pay/shared-logging.

Description: every error path that throws with a payload may end up in stdout via Pino. No field-level scrubber. National-ID, phone, email, idempotency keys, raw HMAC signatures may all be logged.

Acceptance: Pino redaction config covers: req.headers.authorization, req.headers["x-demoz-signature"], req.body.password, *.nationalId, *.phone, *.email (last-4 only), *.idempotencyKey. Unit test passes a payload containing each field and asserts they're masked. Same applied to Go services via slog handler attribute filter.

Effort: 2 days.

Blast radius: every error log is a PII leak. Regulator finding. Pilot-blocking under DPA / Ethiopia data-protection requirements (forthcoming).


§2. Money correctness (4 items)

GL-04 — Build LookupAccount adapter primitive — PARTIAL (Phase C, 2026-05-29): existence-check LIVE, name-match PLANNED

Status (existence-check, LIVE): Phase C ships the full fail-closed gate. Real gRPC LookupAccount RPC; Dashen adapter HTTP+HMAC real (against bank-sandbox); mock adapter prefix-driven; bank-sandbox GET /api/v1/accounts/:account real; EWA + Lending disburse use cases reject with 409 *_DESTINATION_ACCOUNT_INVALID when partner cannot confirm account exists. No ledger entry, no InitiateDisbursement, no outbox event on lookup-fail. Audit row written. Correlation ids round-trip. Prometheus metrics exposed (lookup_success_total / lookup_failure_total{reason} / lookup_latency_seconds). 7/7 E2E tests via verify-c-lookup.sh. S3 regression unchanged. See PHASE_C_LOOKUP_ACCOUNT.md.

Status (name-match, PLANNED — Phase C continuation): the adapter returns resolved_holder_name, but no use case compares it against an expected name because the DTO + aggregate carry no "expected holder name" field. Adding it is a Phase C continuation. Pilot impact: an account that exists but is registered to a different holder will pass the gate. Operator-side reconciliation against bank statement holder names is the manual mitigation.

Description: services/integration-gateway/internal/server/lookup_and_health.go:24 returns codes.Unimplemented. Production disbursement must verify the destination account exists at the partner bank and is named what the employer claims it is, BEFORE money moves.

Acceptance: Dashen adapter implements LookupAccount(account_id)(exists, resolved_name, partner_health). EWA + lending disburse use cases call LookupAccount before InitiateDisbursement; mismatch (account doesn't exist OR resolved-name doesn't fuzzy-match employer-provided name) returns 409 ACCOUNT_MISMATCH. Integration test against bank-sandbox covers both happy + mismatch.

Effort: 3 days (adapter + use-case wiring + test).

Blast radius: a misrouted disbursement that the bank accepts is unrecoverable without partner cooperation. Reputational + financial loss. Pilot-blocking.


GL-05 — Build EWA repayment use case + payroll-event consumer (or admin endpoint) — CLOSED for admin path (Phase B1, 2026-05-29)

Status: LIVE for the admin endpoint path. RecordEwaRepaymentUseCase ships; POST /api/ewa/requests/:id/record-repayment (admin/owner). Ledger journal correct (DR PayrollClearing / CR ReceivableFromEmployee for principal+fee). Outbox event ewa.repaid.v1 emitted. 5 unit tests cover happy/replay/non-disbursed/not-found/cross-tenant. Payroll-event consumer path still PLANNED — depends on the payroll domain landing (Phase B continuation). Admin path is sufficient for pilot scale.

Description: EWA can be disbursed but has no path to REPAID. The status transition exists in the enum; no caller invokes it.

Acceptance: RecordEwaRepaymentUseCase exists in packages/ewa/backend/application/. It accepts (ewaRequestId, repaidAmountSantim, source); posts ledger entries DR payroll-clearing / CR receivable-from-employee; transitions EWA DISBURSED → REPAID; emits ewa.repaid.v1. Admin endpoint POST /api/ewa/requests/:id/record-repayment is wired (mirrors lending's admin endpoint). 5 unit tests cover the use case. If payroll consumer exists in pilot window, wire it. If not, admin endpoint is sufficient for pilot scale.

Effort: 1.5 days (admin path) or 3 days (with payroll consumer).

Blast radius: every EWA disbursed in pilot is an open receivable forever. Month-end reconciliation surfaces a multi-million-santim "unclaimed" balance. Pilot-blocking.


GL-06 — Build PayrollClearing → FI remittance for repaid loan installments — PARTIAL (Phase B2, 2026-05-29): ledger side LIVE, bank side PLANNED

Status (ledger side, LIVE): RemitInstallmentToFiUseCase ships; POST /api/loans/:id/installments/:idx/remit-to-fi (admin/owner). Ledger journal correct (DR payable-to-fi-partner[fi_id] / CR payroll-clearing for principal). Outbox event loan.installment_remitted_to_fi.v1 emitted. 5 unit tests cover happy / not-repaid / no-FI-mapping / cross-tenant / replay. Phantom-asset growth on PayrollClearing is structurally closed.

Status (bank side, PLANNED — Phase B continuation): the actual outbound bank transfer (Employer payroll account → FI's pool account) is NOT wired. Requires schema change (no current "employer payroll account number" field) + gateway adapter routing for tenant-owned source accounts. Not pilot-blocking at single-employer scale — operator-driven weekly bank ops + recon catches divergence. Becomes blocking at multi-employer scale.

Description: RecordRepaymentUseCase debits PayrollClearing and credits ReceivableFromBorrower. The matching outbound transfer (PayrollClearing → FI's bank account) is not implemented. PayrollClearing accumulates as a phantom asset.

Acceptance: new use case RemitInstallmentToFiUseCase; per loan.installment_repaid.v1, queue an outbound disbursement via the gateway from payroll-clearing account → FI's pool account; pre-commit ledger entry DR payable-to-fi-partner[fi_id] / CR payroll-clearing; settle on bank confirmation. Idempotent on (loanId, installmentIndex). Integration test against bank-sandbox covers happy + retry paths.

Effort: 3 days.

Blast radius: PayrollClearing grows without bound; FI partner reports show their owed-amount didn't decrease; partnership-credibility incident. Pilot-blocking.


GL-07 — Idempotency-record TTL cleanup job

Description: IdempotencyRecord table has no DELETE policy. At pilot scale this is fine; at production scale (months 6–12) the table reaches hundreds of millions of rows and index-bloat degrades every money-moving POST.

Acceptance: daily cron deletes IdempotencyRecord rows where createdAt < NOW() - INTERVAL '14 days' AND result IS NOT NULL (i.e., completed). Documented retention window in docs/runbooks/idempotency-cleanup.md (new). Counter demozpay_idempotency_records_total exposed. Test against a seeded DB confirms the rows are removed.

Effort: 0.5 day.

Blast radius: silent performance death. Not pilot-blocking; required before month 3.


§3. Operational visibility (5 items)

GL-08 — Alerting backbone (Alertmanager / Grafana OnCall / PagerDuty) — PARTIAL (Phase D, 2026-05-29): rules + SLOs LIVE as code, deployment PLANNED

Status (code, LIVE): infra/prometheus/alerts.yml covers every metric currently exposed (API + gateway + ledger + recon-runner) with severity tags + runbook links. infra/prometheus/prometheus.yml provides the scrape config template. docs/architecture/SLOS_AND_ALERTING.md codifies every SLO, error budget, and escalation path.

Status (deployment, PLANNED): Prometheus + Alertmanager + paging provider (PagerDuty / Opsgenie / Slack-OnCall) are operational deliverables. The cost is well-bounded — separate platform-team workstream, ~5-10 elapsed days. Code-side is reviewable now.

Description: counters exist; nothing fires on them.

Acceptance: at minimum one paging integration is deployed. The following alert rules fire to it:

  • demozpay_bank_webhook_requests_total{result="signature-rejected"} > 5% of total over 5min → page.
  • demozpay_dependency_up{dependency="integration-gateway"} == 0 for > 1min → page.
  • demozpay_dependency_up{dependency="ledger"} == 0 for > 1min → page.
  • demozpay_outbox_oldest_unpublished_age_seconds > 300 → page.
  • demozpay_settlement_poller_ticks_total{outcome="errored"} rate > 0.5/min over 5min → page.
  • New rule: any non-zero drift from Ledger.ReconcileWithBank → page immediately.

Each alert links to its runbook. On-call drill executed.

Effort: 2 weeks (Grafana + Alertmanager + paging integration + rule authoring + drill).

Blast radius: incidents go unnoticed. Pilot-blocking.


GL-09 — Daily reconciliation cadence — PARTIAL (Phase D, 2026-05-29): runner LIVE, cron-wiring PLANNED

Status (runner, LIVE): services/integration-gateway/cmd/recon-runner/main.go ships as a standalone Go binary. Accepts --tenants + --partners flags, queries pg-gateway via GATEWAY_DATABASE_URL, invokes the existing Runner + Matcher, emits structured JSON summary on stdout (scanned / matched / flagged / errors / next-action-hint per (tenant, partner) pair, plus aggregate). Exits non-zero on runner errors; zero on flagged lines (they are signal, not runner failure). Compiles + builds clean. PostgresStore.FindByPartnerReference added so the matcher can do production lookups.

Status (cron-wiring, PLANNED): the operator schedules this binary via kubernetes CronJob / docker-compose cron sidecar / GitHub Actions / etc. Wiring is an ops swap.

Status (pushgateway / last-success metric, PLANNED): the alert rule DemozpayReconciliationRunnerMissedDay references a counter the runner doesn't yet emit. Phase D continuation: switch from stdout-JSON to pushgateway-style emission.

Description: the reconciliation primitive (ReconcileWithBank) is LIVE. No scheduled invoker runs it.

Acceptance: scheduled job (cron in k8s, GitHub Actions, or hosted scheduler) runs daily at T+1 07:30 UTC per (tenant, partner, account): ingest statement → match → call ReconcileWithBank → publish summary to a Slack channel + drift counter. The summary message says green: all 5 accounts drift=0 OR enumerates the offenders.

Effort: 1 week (job + ingestion-fetch + Slack hook + test).

Blast radius: drift accumulates silently. Pilot-blocking.


GL-10 — Statement-pull automation per partner

Description: statements are dropped into the ingestion pipeline manually. A skipped day is invisible.

Acceptance: per partner adapter, a FetchStatement(period) method exists and pulls the file from partner SFTP / partner API. Implemented for Dashen. Job runs daily. Counter demozpay_statement_ingest_age_seconds tracks "time since last successful ingest per partner".

Effort: 1 week (Dashen-specific; longer per partner).

Blast radius: silent recon blindness. Pilot-blocking.


GL-11 — Prometheus metrics on Go services — CLOSED (Phase D, 2026-05-29; gateway side closed in Phase C)

Status: LIVE on both Go services. Gateway (Phase C): lookup_success_total / lookup_failure_total{reason} / lookup_latency_seconds exposed at :50053/metrics. Ledger (Phase D): demozpay_ledger_rpc_requests_total{rpc,outcome} + demozpay_ledger_rpc_latency_seconds{rpc} histogram + demozpay_ledger_entries_posted_total + demozpay_ledger_transaction_status_transitions_total{from,to} + demozpay_ledger_reconcile_drift_santim{partner} gauge. Surfaced via a gRPC interceptor + HTTP listener on :50054/metrics + /healthz. Interceptor unit-tested (4 outcome classes + shortMethod + outcomeFromError). Cardinality-disciplined: no tenant_id labels.

Description: services/ledger/ and services/integration-gateway/ have NO Prometheus instrumentation. Ledger could fail without anyone noticing.

Acceptance: Go services expose /metrics on the same port pattern as the API. Counters: RPC count + duration histogram per RPC + DB-pool saturation + error count per RPC. Wired into scrape config.

Effort: 1 week.

Blast radius: ledger or gateway silently degrades and no chart shows it. Pilot-blocking.


GL-12 — Per-partner adapter health surfacing

Description: GetAdapterStatus always returns HEALTHY (services/integration-gateway/internal/server/lookup_and_health.go:48).

Acceptance: adapter tracks rolling-window error rate + p95 latency; GetAdapterStatus returns DEGRADED if error-rate > 5% over 5min OR p95 > 10s. API consults adapter health before submitting a non-urgent disbursement.

Effort: 2 days.

Blast radius: a degraded partner pulls down user-facing flows without surfacing the cause. Operational pain.


§4. Compliance + identity (3 items)

GL-13 — KYC primitive

Description: no packages/kyc/ exists. Production cannot ship in Ethiopia without identity verification — Fayda national ID lookup, document capture, liveness check.

Acceptance: at minimum, a verification endpoint that captures (nationalId, photo, document) and stores under a tenant-scoped table; on creation, an outbox event kyc.verification_requested.v1 is emitted for a (planned) backend reviewer or third-party provider integration. Pilot tier-1 allows manual reviewer approval; tier-2 integrates Fayda when API access lands.

Effort: 2 weeks (data model + basic UI + manual workflow).

Blast radius: regulator finding on first conversation. Pilot-blocking.


GL-14 — Sanctions screening on every disbursement

Description: no sanctions screening. A disbursement to a sanctioned individual is a partnership-ending event.

Acceptance: pre-InitiateDisbursement step calls a sanctions-check primitive (an enum-list match against OFAC + UN + Ethiopia sanctions lists is sufficient for pilot; a real provider integration is a Q3 item). On hit, disbursement halted, audit row created, alert raised, manual-review workflow triggered.

Effort: 1 week (enum-list pilot tier).

Blast radius: regulator + partner-bank finding. Pilot-blocking.


Status: docs/adr/ADR-014-orchestrator-not-custodian.md shipped. Cites: ledger taxonomy (no cash account — GAP-04 enforced), schema-level enforcement (ADR-006 forbids stored balances; legacy Wallet columns @deprecated + ESLint-blocked), money-flow invariant (every disbursement traces partner-bank → partner-bank), truth ranking (bank wins over ledger), and the multi-layer enforcement mechanism. Lists the four custody alternatives that were rejected with reasoning. Indexed in docs/adr/README.md.

Pending: Legal / Compliance counter-sign before external use (regulator + partner-bank reviews). Engineering side accepted.

Description: the architectural commitment is implicit across ADR-006 + the gap action plan. It is not written down as a single citable statement.

Acceptance: docs/adr/ADR-014-orchestrator-not-custodian.md exists. Cites which Prisma columns are evidence of orchestrator-only (no cash account, payable-to-business-bank not balance, etc.). Cites NBE licensing categories we operate under and which we don't. Signed by engineering lead + legal/compliance lead.

Effort: 1 day to draft + 1 week for legal review.

Blast radius: without this, the regulator conversation lacks a single authoritative reference. Pilot-blocking on the regulator timeline.


§5. Deploy + safety (4 items)

GL-16 — Production-grade secrets management

Description: every secret is an env var.

Acceptance: secrets live in Vault / AWS SM / GCP SM / Azure KV / Doppler — pick one. Application reads via init-container or sidecar; never writes to .env in any environment. Rotation procedure documented in docs/runbooks/secrets-rotation.md.

Effort: 1 week + ongoing per-secret-source operational cost.

Blast radius: the next quarterly bank-key rotation cannot happen cleanly. Pilot-blocking.


GL-17 — TLS at the edge

Description: every component is HTTP today.

Acceptance: ingress terminates TLS 1.3 with a real cert (Let's Encrypt or a managed cert). HSTS header. Internal traffic between services either also TLS or runs over a service mesh / private network. Smoke test confirms HTTP requests are 301-redirected and HSTS is announced.

Effort: 1 week.

Blast radius: partner-bank security review fails on first question. Pilot-blocking.


GL-18 — Database backups + restore drill

Description: no backup, no restore procedure.

Acceptance: PITR-capable backup (cloud-managed PG or wal-g); off-site retention ≥ 14 days for PG-API + PG-ledger + PG-gateway. Restore drill executed: drop a test DB, restore from backup, verify a recent transaction exists. Recovery time objective + recovery point objective documented (RTO ≤ 4h, RPO ≤ 5min recommended for ledger).

Effort: 1.5 weeks.

Blast radius: first disk failure or accidental TRUNCATE is total. Pilot-blocking.


GL-19 — CI: secret-leak scanning + dependency-vuln scanning

Description: no SAST, no secret-scanning, no dep-vuln in CI.

Acceptance: PR pipeline includes gitleaks (or trufflehog) blocking merge on secret-leak; Snyk / OSV / Dependabot blocking merge on critical-severity vulns.

Effort: 2 days.

Blast radius: an accidentally-committed secret in a Friday PR is in the commit history Saturday morning. Pilot-blocking on basic hygiene grounds.


§6. Operational human-process (3 items)

GL-20 — On-call rota + incident-commander training

Description: runbooks exist; the human side of incident response doesn't.

Acceptance: named on-call rota (≥ 3 engineers, weekly handoff). Each named engineer has done one tabletop incident drill against one of the four go-live runbooks. Incident-commander role + scribe role documented. Postmortem template written.

Effort: 1 week elapsed (mostly people, not code).

Blast radius: first real incident is chaos. Pilot-blocking.


GL-21 — Routine reconciliation triage runbook

Description: the 4 go-live runbooks cover incidents. They do not cover the routine daily-recon triage process. The first day the on-call reads "drift = 0" they will not know that's the green-light.

Acceptance: docs/runbooks/reconciliation-daily-process.md exists. Walks through: where to find the daily summary; what each panel means; when to escalate; how to close-out the day; the audit trail for "I triaged today's recon at 09:14 UTC and it was clean."

Effort: 1 day.

Blast radius: routine triage is improvised. Pilot-blocking.


GL-22 — Sign-off matrix

Description: the action plan calls for engineering / security / finance / ops sign-off before go-live. The sign-off form doesn't exist.

Acceptance: a one-page sign-off form lists each of GL-01..21 (and any added between now and pilot). Each item has a verifier name + date + evidence link. Stored as docs/architecture/GO_LIVE_SIGNOFF.md and filled in at the gate. Counter-signed by all leads.

Effort: 0.5 day.

Blast radius: going live without a documented sign-off creates ambiguity about who said yes. Pilot-blocking on process grounds.


§7. The honest count

22 blockers. All of them are achievable in 60–90 days with a focused 3-engineer team. None individually is a multi-month project. Several depend on each other (GL-08 alerting depends on GL-11 Go metrics; GL-13 KYC depends on GL-15 ADR-014 being written).

Pilot does NOT mean "all 22 closed." Pilot means the pilot-blocking items closed. The list below partitions them:

Pilot-blockers (must close BEFORE first real money to first real employer):

GL-01, GL-02, GL-03, GL-04, GL-05, GL-06, GL-08, GL-09, GL-10, GL-11, GL-13, GL-14, GL-15, GL-16, GL-17, GL-18, GL-20, GL-21, GL-22.

That's 19 items.

Pre-month-3 (close within first 90 days post-pilot):

GL-07 (idempotency TTL), GL-12 (adapter health), GL-19 (secret-leak scanning).

That's 3 items.

§8. What this list does NOT cover

  • Frontend integration (admin + employer + employee web apps still mock). A pilot can run on API-only with a single integrated frontend (recommend employer-web first). The other four are next-quarter work.
  • BNPL / Savings / Equb domains. Out of scope for first pilot. Roadmap conversation, not a blocker.
  • Advanced fraud (behavioural analytics, device fingerprinting). Post-pilot. Pilot tier uses static velocity ceilings (TODO add as a sub-item under GL-14).
  • Payroll domain. Critical for scaling beyond pilot. The pilot can run on admin-triggered repayment (GL-05 + GL-06's admin paths). At 20+ employers payroll becomes a hard requirement.
  • Mobile apps. Post-pilot. Web is sufficient for pilot.

§9. Decision rule

Hold this list up to anyone who says "let's ship the pilot."

  • If they want to skip GL-01..03: they don't understand the AuthZ model.
  • If they want to skip GL-04..06: they don't understand the money model.
  • If they want to skip GL-08..11: they don't understand operational reality.
  • If they want to skip GL-13..15: they don't understand regulatory reality.
  • If they want to skip GL-16..19: they don't understand deployment reality.
  • If they want to skip GL-20..22: they don't understand incident reality.

All 19 pilot-blockers are reasonable engineering investments. None is exotic. The cost of opening a rail before any one of them is closed exceeds the cost of closing all 19.

§10. Cross-references