Target Architecture Alignment Plan
Purpose: the step-by-step execution plan to move DemozPay from its current code to the
Target Platform Architectureblueprint. The blueprint says where we're going and why; this document says how, in what order, and when it's done.How to use: work top-to-bottom. Each workstream has checkboxes, acceptance criteria, and a phase exit gate that must be green before starting the next phase. Tick boxes as you land PRs; link the PR next to each box.
Related:
ARCHITECTURE_ALIGNMENT_PLAN.md·ARCHITECTURE_ALIGNMENT_WORKLOG.md·PAYOUT_ROUTING_PLAN.mdMaturity legend (same as blueprint): 🟢 Today · 🔵 MVP · 🟡 Post-MVP · ⚪ Long-term
Guiding rules for every step
- Never break money to chase architecture. Track A (launch blockers) lands before any extraction work.
- The contract is permanent; the deployment is not. Define the proto/event schema before moving code.
- No step is "done" until deployed + wired + tested. Half-built ≠ done (this is the disease we're curing).
- One PR = one reversible step. Money-moving PRs satisfy ADR-005…ADR-013 and get founder review.
- Prove RLS under a non-superuser role for anything touching tenant data (local superuser hides every RLS hole).
At-a-glance phase map
| Phase | Goal | Duration | Exit gate |
|---|---|---|---|
| Phase 0 | Launch MVP: money flows work end-to-end; contract + schema spine in place | ~8 wks | §Phase 0 exit gate |
| Phase 1 | First customers: extract Payroll→Go, Notifications, Recon; lightweight orchestrator | ~3–6 mo | §Phase 1 exit gate |
| Phase 2 | Growth: small K8s, per-service CI, extract Screening/Fraud | as load warrants | §Phase 2 exit gate |
| Phase 3 | Regional scale: DR, 2nd/3rd bank adapter, Treasury | regional demand | §Phase 3 exit gate |
| Phase 4 | Millions of users: shard Ledger, multi-region, CQRS | scale demand | §Phase 4 exit gate |
PHASE 0 — Launch MVP (now → ~8 weeks)
Three parallel tracks. Track A is the critical path — it can ship independently and must land first. Track B is the architecture spine. Track C is cheap cleanup that can happen anytime.
Track A — Launch blockers (correctness; highest priority)
These are verified bugs that prevent money from moving end-to-end. None require the new architecture — they fix what exists.
A1 — Deploy the Go money tier 🔵
Current: services/ledger and services/integration-gateway have Dockerfiles but are not in docker-compose.yml; apps/api is configured with LEDGER_GRPC_ADDR: ledger:50051 and INTEGRATION_GATEWAY_GRPC_ADDR: integration-gateway:50052 (compose lines ~119-120) — dead hosts.
Target: both services run in compose with their own databases; the API's gRPC dials resolve.
- Add
ledgerservice todocker-compose.yml(image build fromservices/ledger/Dockerfile,LEDGER_DATABASE_URL, port50051). - Add
integration-gatewayservice (GATEWAY_DATABASE_URL, port50052). - Create their databases (separate DB or schema — see B1) + run their raw-SQL migrations on boot.
- Add a boot-time health/dial check in
apps/apithat fails loudly if the ledger/gateway gRPC is unreachable (no more silent dead-host). - Update
README.mdStep-3 migrations + Step-4 health check to include both services. Acceptance:pnpm docker:upbrings up ledger + gateway;apps/apilogs a successful gRPC handshake to both; a manual ledgerPostTransactionround-trips. Effort: ~1–2 days · Risk: Low
A2 — Enable the payroll→repayment consumer 🔵
Current: PayrollConsumersModule is commented out at apps/api/src/app/app.module.ts:172 with a TODO(boot-fix) — RecordEwaRepaymentUseCase / RecordRepaymentUseCase aren't visible to its DI scope. So payroll.deductions_taken.v1 events pile up and no EWA/loan repayment is ever recorded.
Target: the consumer is enabled and records repayments.
- Restructure DI: move
EwaModule.register()/LendingModule.register()into the@Global*ApiModule.importsand re-export, then import the@GlobalApiModules fromPayrollConsumersModule(the fix the TODO itself describes). - Uncomment
PayrollConsumersModuleinapp.module.ts. - Add an integration test: payroll deduction event → EWA repayment recorded. Acceptance: approving a payroll run with an active EWA advance records a repayment in the ledger; test green. Effort: ~2–3 days · Risk: Medium (DI surgery)
A3 — Fix the Bank Gateway webhook RLS bug 🔵
DIAGNOSED (2026-06-18), not yet applied. Root cause confirmed:
ListTenantsWithNonTerminal(services/integration-gateway/internal/store/postgres_store.go:199) callss.pool.Querydirectly — noapp.tenant_id— while every other method usespg.WithTenantTx. Under FORCE RLS (migrations/0001_init.up.sql:200-206,USING tenant_id = current_setting('app.tenant_id', true)) it returns 0 rows → empty tenant list →webhook handler.go"no matching disbursement" → HTTP 200 with no state change. Hidden in dev because thegatewaydocker role is a Postgres superuser (bypasses RLS) — only manifests under a NOSUPERUSER prod role, so verification requires A4's harness. Nonce replay:adapters/dashen/signing.go:55-72validates a 5-min timestamp window + HMAC but never records/checks the nonce → replayable for 5 min. Recommended fix: (1) aSECURITY DEFINERSQL functiongateway_resolve_disbursement(partner, partner_ref)owned by a BYPASSRLS role (or a second pool on a dedicated BYPASSRLS role) for the trusted cross-tenant webhook lookup; (2) awebhook_nonce(partner, nonce, seen_at)UNIQUE table (or Redis SETNX, TTL = window) checked inverifyIncoming. Blocked here: no Go toolchain in this environment (can't compile/test); do in a Go env, sequenced after A4. Current: the gateway webhook handler does a cross-tenant read with noapp.tenant_idset; underFORCE RLSit returns 0 rows, falls through to "no matching disbursement," and returns 200 without advancing state → async settlement silently never completes. No nonce/replay check (only a 5-min HMAC window). Target: settlement webhooks correctly match + advance the disbursement; replays are rejected.
- Set tenant context (or use a scoped BYPASSRLS role) before the disbursement lookup in the gateway webhook path.
- Persist + check the
X-Demoz-Nonceto reject replays. - Add a test: a valid settlement webhook flips the disbursement to SETTLED; a replayed webhook is rejected. Acceptance: bank-sandbox settlement webhook advances state to SETTLED/FAILED; replay returns 409/ignored; test green. Effort: ~2–3 days · Risk: Medium (money path)
A4 — Prove RLS under a non-superuser role in CI 🔵
Current: local docker myuser is SUPERUSER + BYPASSRLS, so every FORCE RLS policy is silently bypassed; RLS correctness is unverified.
Target: a CI job runs the cross-tenant denial tests under a NOSUPERUSER role.
- Add a CI Postgres role
app_rls_test(NOSUPERUSER, noBYPASSRLS). - Write/relocate cross-tenant denial tests (read + write) for each financial table behind RLS.
- Wire into the CI pipeline; fail the build on any cross-tenant leak. Acceptance: CI proves tenant A cannot read/write tenant B's financial rows under the non-superuser role. Effort: ~3–4 days · Risk: Medium (may surface real RLS holes — that's the point)
A5 — Fix the SessionMiddleware 401 gap + add a guard 🔵
Current: apps/api/src/app/app.module.ts:219-273 is a hand-maintained .forRoutes(...) class list; ≥9 authenticated controllers are missing (PayrollPlatformAdmin, AutoLockPolicy, CourtOrderRemit, PayrollAudit, PayrollPdf, PayrollSettlement, BankWebhookReplay, and the entire /api/me/* self-service surface — Me + MeEqub) → they 401 on every authenticated request. CLAUDE.md admits only 2 of these; the real gap is larger and includes all of employee-web's self-service (profile, loans, private Equb).
Target: auth middleware applies uniformly; drift is impossible.
- Replace the manual class list with a path-glob / global middleware that skips only
@Public()+ health/metrics. - Add a CI guard (or test) that fails if any non-
@Public()controller is unreachable through the middleware chain. - Smoke-test the previously-broken routes (
/api/me/*, payroll PDF/audit/settlement). Acceptance: all authenticated controllers resolvereq.user; CI guard catches a deliberately-unregistered controller. Effort: ~1 day · Risk: Low
Track B — Contract & schema spine (the architecture foundation)
This is what makes every later extraction a swap, not a rewrite. Do it pre-launch while it's cheap.
B1 — Carve the shared Prisma schema into per-context schemas 🔵 ⭐ spine
Current: one apps/api/prisma/schema.prisma (~3,074 lines, ~71 models) with cross-domain foreign keys (payroll → employee → org).
Target: per-context Postgres schemas in one instance (Identity, Tenancy, Workforce, Payroll, KYC) — no cross-context FK, no cross-context join; cross-context references by ID.
- Map every model to an owning context (Identity / Tenancy / Workforce / Payroll / KYC / Money). Produce an ownership table (mirror §9 of the blueprint).
- Identify + list every cross-context FK that must become an ID reference.
- Migrate models into per-context Postgres schemas (
iam,tenancy,workforce,payroll,kyc); drop cross-context FKs; replace with by-ID references + app-level validation. - Update repositories/adapters to resolve cross-context refs via the owning context (port call), not a join.
- Keep RLS
FORCEper context; re-run A4 tests. Acceptance: no FK crosses a context boundary (verified by a schema lint); all existing flows pass; RLS tests green per context. Effort: ~2–3 weeks · Risk: High (the biggest Phase-0 item — do it carefully, one context at a time, expand-contract migrations). Dependency: do before any Phase-1 "separate database" work.
B2 — Wire Kafka with real consumers + schema registry 🔵
Current: outbox → Kafka publisher exists but zero consumers (real consumers poll Postgres); OUTBOX_PUBLISHER_ENABLED defaults false; events are stringly-typed (payload as any).
Target: transactional outbox → Redpanda → real idempotent consumers; protobuf event schemas in a registry; CI backward-compat gate.
- Stand up a schema registry (Redpanda registry) in compose.
- Define protobuf schemas for the MVP events:
demoz.payroll.run.approved.v1,demoz.ledger.entry.posted.v1,demoz.disbursement.settled.v1,demoz.kyc.approved.v1. - Convert the outbox publisher to publish schema-validated events; turn
OUTBOX_PUBLISHER_ENABLEDon by default in dev. - Move the real cross-context reactions onto Kafka consumers: payroll→repayment (ties to A2), →notifications, →recon-input.
- Make every consumer idempotent (dedup on event ID) + add a per-consumer DLQ topic + alert.
- Delete or quarantine the ~110 phantom event types that have no consumer (keep the audit-only ones explicitly labeled). Acceptance: a payroll approval produces a registry-validated event consumed by the repayment + notification consumers; replaying the event is a no-op (idempotent); a poisoned message lands in DLQ. Effort: ~1–1.5 weeks · Risk: Medium Dependency: A2 (repayment use cases) for the payroll consumer.
B3 — Promote packages/contracts to the integration law 🔵
Current: protos exist for ledger/gateway but the boundary is real only there; no breaking-change gate; payroll has no extraction seam.
Target: contract-first for every MVP boundary; buf breaking-change CI gate; payroll behind a gRPC-shaped contract so the Phase-1 Go swap is invisible.
- Add a
buflint + breaking-change check to CI againstpackages/contracts/grpc. - Define a
PayrollEnginePortproto (calculate/approve/disburse run) — even though payroll stays NestJS, the contract exists now. - Add consumer-driven contract tests for ledger + gateway clients.
- Document the proto→
buf generate→implement workflow inpackages/contracts/README.md. Acceptance: a breaking proto change fails CI; payroll is callable through its port contract; contract tests green. Effort: ~1 week · Risk: Low
B4 — Establish the API Gateway / edge layer 🔵
Current: the NestJS apps/api is the API; edge concerns (auth, idempotency-key minting, rate-limit, API versioning, tenant routing) are scattered across middleware/guards, and the SessionMiddleware list is hand-maintained (see A5).
Target: a single, explicit edge layer owning auth + idempotency + rate-limit + versioning. (A separate gateway deployable / per-audience BFFs are Post-MVP — at MVP this is a clean layer inside the existing app, per blueprint §3.)
- Consolidate edge concerns into one clearly-named module/layer (
app/edgeor a gateway module): global auth (ties to A5), idempotency-key minting (ADR-007), rate-limit, API version prefix (/api/v1). - Document the edge contract so a real gateway/BFF can be lifted out later without touching domain code.
- (Defer) separate BFF deployables per audience → Phase 1/2. Acceptance: all edge responsibilities live in one place; adding a route doesn't require editing a hand-maintained middleware list. Effort: ~2–3 days · Risk: Low
Track D — apps/api code structure & bounded-context alignment
Why this exists (the gap a senior caught): apps/api/src has 50 flat folders / 46 modules that don't reflect the architecture. They actually belong to ~6 bounded contexts + infra. The DB carve-out (B1) and the code carve-out (this track) are the same boundary and should be done together — otherwise the schema says "contexts" while the folders still say "50 features."
Current → target folder mapping:
| Bounded context | Today (flat folders) | Target |
|---|---|---|
| Infra / cross-cutting (~18) | app, config, email, sms, health, observability, prisma, idempotency, outbox, dead-letter, resilience, scheduling, shared-infra, uploads, notification-consumers, grpc-auth, assets, types | apps/api/src/_infra/* (or keep flat; clearly separated from domains) |
| Identity / Tenancy (~12) | auth, tenancy, tenant, organization, organization-provisioning, members, roles, platform-staff, merchant, financial-institution, me, identity-verification | apps/api/src/identity/* |
| Workforce (~11) | employee, employee-absence, employee-allowance, employee-deduction, absence-type, allowance, deduction, department, position, employment-type, payment-frequency | apps/api/src/workforce/* |
| Payroll (2) | payroll, payroll-consumers | apps/api/src/payroll/* |
| Money (2) | banking, integration | apps/api/src/money/* |
| Compliance (2) | kyc, sanctions (+ identity-verification) | apps/api/src/compliance/* |
| Products (3) | ewa, lending, equb | apps/api/src/products/* |
"Did it follow the domain-package pattern?" — verified module classification. This is the map of which modules need the most work. Bucket 1 is done right; the alignment effort is promoting Buckets 2 and 3 up to it.
| Bucket | Meaning | Modules | Action |
|---|---|---|---|
| 1 — Proper package + thin API wrapper ✅ | packages/<domain>/backend exists; API module just wires it; Prisma only in infra adapters | payroll, ewa, lending, equb, kyc, sanctions, banking | Keep as the reference shape |
| 2 — Half-carved ⚠️ | tenancy package exists but is only partly adopted: organization / merchant / financial-institution / roles wrap it yet still carry their own Prisma (the ~1,000-LOC triplication); members / platform-staff / organization-provisioning ignore the package entirely | tenancy + organization, merchant, financial-institution, roles, members, platform-staff, organization-provisioning | D3 — finish the carve-out; move logic into the tenancy package; one polymorphic controller |
| 3 — Ad-hoc, no package ❌ | logic + Prisma live directly in apps/api/src; no domain package exists | Workforce: employee + the 10 catalog/child modules (department, position, employment-type, payment-frequency, allowance, deduction, absence-type, employee-allowance/deduction/absence); identity-verification (belongs in compliance/kyc); auth (wiring OK in api, but no domain package — leans on shared-auth + tenancy) | D1/D2 (collapse catalog) + create a Workforce module/package boundary; D5 folds identity-verification into compliance |
| Infra (correctly ad-hoc) | no package needed | email, sms, uploads, outbox, dead-letter, scheduling, idempotency, resilience, integration, me, health, observability | Leave as-is |
The irony to internalize: payroll is a clean package, but
employee— which payroll depends on — is entirely ad-hoc, andtenancyis half-and-half (logic duplicated in both the package and three fat controllers). The money/compliance half respected the architecture; the trust/identity half didn't. B1 (schema) and D1/D3 (code) must move Buckets 2 + 3 to match Bucket 1 on the same context boundaries.
D1 — Reorganize apps/api/src into bounded-context folders 🔵
Pair with B1 — same context boundaries for code and schema.
- Group the 50 folders under context parents (table above); update Nx tags + tsconfig paths + import paths.
- Add an ESLint/Nx boundary rule: no cross-context import except via the context's public entry (mirrors ADR-011 inside
apps/api). Acceptance:apps/api/srcshows ~6 context folders + infra, not 50 flat ones; cross-context imports fail lint. Effort: ~3–5 days (mostly moves + import fixups) · Risk: Low–Medium (churn; do as one mechanical PR).
D2 — Collapse the 10 catalog CRUD modules into one factory 🔵
Current: absence-type, allowance, deduction, department, position, employment-type, payment-frequency (+ employee-allowance/deduction/absence) are ~identical list/getOne/create/update/archive + audit + outbox modules (~77–94-line controllers each).
- Build a generic
TenantCatalog<T>module factory (model name + permission tuple + DTO as config). - Migrate the 10 modules onto it; delete the boilerplate (~2,000 LOC → ~300). Acceptance: one factory drives all catalog resources; behavior + tests unchanged. Effort: ~3–5 days · Risk: Medium
D3 — One polymorphic Organization controller; finish the tenancy carve-out 🔵
Current: organization, merchant, financial-institution re-implement the same :id/admins/*, :id/team-members/*, :id/documents/* route trees over the same @demoz-pay/tenancy use-cases (~1,000 LOC triplication, differing only by a document-catalog constant); organization/merchant/platform-staff/financial-institution/members carry domain logic + raw Prisma in fat controllers/services instead of the tenancy package.
- Extract one polymorphic
OrganizationKindControllerkeyed onkind(business/merchant/FI); collapse the triplication. - Move the remaining org/merchant/FI/platform-staff/members business logic into the
tenancypackage (it's the real owner). - Remove direct
PrismaService/Prismainjection from controllers. Acceptance: one controller serves all org kinds; no domain logic or raw Prisma left in those controllers. Effort: ~1–2 weeks · Risk: Medium–High (largest apps/api refactor) · Pairs with B1 Identity/Tenancy carve-out.
D4 — Dedupe shared controller utilities 🔵
Current: interface AuthenticatedRequest ×22, function mustGetActor ×14, parseSantimMaybeNull ×4 — copy-pasted.
- Move
AuthenticatedRequest+mustGetActorinto one shared auth util (shared-infraor@demoz-pay/shared-auth). - Move
parseSantimMaybeNullinto@demoz-pay/shared-money. - Replace all copies with the import. Acceptance: one declaration of each; grep counts drop to 1. Effort: ~1 day · Risk: Low
D5 — Co-locate / relate overlapping modules 🔵
Investigated 2026-06-19 — NOT a mechanical merge.
identity-verification(@Controller('identity'):/identity/start,/identity/sessions/:id— a pluggableIdentityVerifiermechanism: Fayda-mock + manual-upload adapters producing aVerifiedIdentity) andkyc(packages/kyc: the submission lifecycle submit→review→approve/reject, sanctions-gated) have zero cross-references — they are complementary, not duplicative, and currently disconnected (a latent gap: a KYC submission should be able to consume an identity-verification session). Folding one into the other would lose distinct functionality.
- Design decision first (not a cleanup): should the
kycsubmission flow consume theIdentityVerifiermechanism as its verification backend? If yes → makeidentity-verificationthe verifier adapter behind a kyc port (a feature, needs a runnable env to verify). If no → leave as sibling surfaces but co-locate both under the Compliance/Identity context folder (D1) without merging code. - Merge
tenant/(theTenantContextMiddleware) into the Identity context next totenancy/— one home for the tenancy concept. (This half is a safe move.) Acceptance: the kyc↔identity-verification relationship is decided + documented;tenant//tenancy/co-located. Effort: ~1 day (co-locate) / feature-sized (if wiring kyc→verifier) · Risk: Low (co-locate) / Medium (wiring)
D6 — REST consistency pass 🔵
Current: employee + organization use @Put(':id'); the other 14 update routes use @Patch; archive (POST) vs DELETE is split.
- Standardize partial-update on
PATCHeverywhere; standardize soft-remove onPOST :id/archive(ADR-009 spirit). Acceptance: verb usage is uniform across controllers. Effort: ~0.5 day · Risk: Low
Track C — Documentation, ADRs & dead-code cleanup (cheap, anytime)
C1 — Write the new ADRs 🔵
Target: decision records behind the blueprint (referenced in §17).
- ADR-018 Progressive extraction behind stable contracts
- ADR-019 Database-per-service (per-context schemas at MVP)
- ADR-020 Contract-first integration (buf + schema registry)
- ADR-021 Lightweight saga orchestration; Temporal deferred
- ADR-022 Kafka naming / ownership / versioning / DLQ / replay
- ADR-023 API Gateway + BFF; REST at edge only
- ADR-024 Custody model: partner banks hold funds, Ledger records, no internal wallet
- ADR-025 Bank Gateway adapter contract; bank-sandbox is a dev adapter
- Fix
docs/adr/README.mdindex (currently stops at ADR-014) + the broken ADR-017 link. Effort: ~1–2 days · Risk: None
C2 — Fix lying / stale docs 🔵
-
docs/architecture/MONEY_FLOWS.mdFlow 6,BANK_ORCHESTRATION.mdPath D,90_DAY_EXECUTION_PLAN.md:289— remove "payroll does not exist" claims. -
DOMAIN_KNOWLEDGE_BASE.md §7— correct the Equb "real escrow" overstatement to "simulated pool, no custody" (regulator-facing risk). - Correct the ADR-009 / CLAUDE.md "ESLint guard" wording (enforcement is DB triggers, not ESLint).
- Refresh
docs/audits/*counts (85 migrations / 71 models / 17→25 ADRs / ~27 controllers). - Fix stale build plans:
EMPLOYEE_MGMT_BUILD_PLAN.mdclaims M7/M8/M9 "not started" but they're shipped;docs/security/AUTH_SYSTEM_REVIEW.mdflags MFA/rate-limit/auth-event-log as missing but they're built. Update both to match code. Effort: ~1 day · Risk: None
C3 — Delete dead code/dirs ⚪→do-now
-
rm -rf libs/(empty, contradicts ADR-002). - Delete
apps/docs-web(0-LOC Docusaurus stub) or fill it. - Delete
services/notifications(28-LOC stub; notifications are in-process TS until Phase 1). - Delete
scripts/test-pr*.mjs(19 throwaway scripts). - Delete the two dead ports (
lending/equb-behavior-signal.port.ts,kyc DocumentStoragePort) and payrollreplay()(or fix+test it). Effort: ~0.5 day · Risk: Low
C4 — Consolidate packages/shared (13 → 8) 🔵
Current: 13 shared packages; median ~107 LOC, 5 are 16–67 LOC, each shipping package.json + 3 tsconfigs + own node_modules. One is dead.
- Delete
shared/validation(403 LOC, 0 consumers — dead husk from the oldlibs/schema; remove itstsconfig.base.jsonpath alias too). - Merge 5 interface-only seams into
@demoz-pay/shared-kernel:database(16) +audit(38) +events(64) +idempotency(133) +logging(107). They're contract + test-double only, share the same consumer set (api/ewa/lending), and all cluster around the ADR-008 transaction. (~358 LOC → one coherent kernel.) - Update the ~150 import sites +
tsconfig.base.jsonaliases. - Keep untouched (real, multi-context):
money,auth,ui,compliance,config,frontend-auth,tenant-context. Note:shared/authis the RBAC brain (66 consumers) — the merge does not touch it, so auth/employee work is unaffected. Acceptance:packages/sharedhas 8 packages;validationgone; all imports resolve; build green. Effort: ~2–3 days · Risk: Low–Medium (mechanical import churn)
Track E — Verify the "done" features (test debt — launch quality)
Why this exists: auth and employee management are built (largely as planned, in places ahead of it) but barely tested. For a fintech, "it works" without automated proof of tenant isolation + permission enforcement is a launch risk, not a done feature. (Verified: the PermissionGuard that gates payroll approve/disburse has zero tests; the employee domain has zero spec files.)
E1 — Test the auth / RBAC enforcement core 🔵
Current: only OrgRoleGuard + 2 policies are tested. PermissionGuard (the money-route enforcer), RolesService kind-validation, provisioning, PlatformPermissionGuard, AdminMfaGuard, and the whole multi-org happy path have no tests. (Spans apps/api/src/identity/auth + packages/shared/auth + packages/tenancy.)
- Unit-test
PermissionGuard+PlatformPermissionGuard(every allow/deny branch). - Unit-test
RolesService.validatePermissionsForKind+ can't-grant-unowned + reserved-name. - Unit-test
OrganizationProvisioningService(one audit row per grant; BA-user compensation; ADR-008 tx). - One integration test for the multi-org flow: provision → magic link → set password → kind routing → invite → accept → MemberRole copy. Acceptance: every RBAC enforcement path has a passing test; the money-route guard is covered. Effort: ~3–5 days · Risk: Low (may surface real auth bugs — the point)
E2 — Test the employee domain (esp. tenant isolation) 🔵
Current: zero spec files for employee/department/position/catalog/child resources; bulkImport untested; tenant-isolation untested (and local superuser can't catch RLS regressions by hand).
- Unit-test
employee.servicecreate/update/archive +validateTenantFks+ override-requires-reason → 400. - Test
bulkImport(per-row FK validation; reject vs silently-skip behavior). - Cross-tenant isolation test for Employee + Department under the A4 non-superuser role (ties to A4). Acceptance: employee CRUD + bulkImport + tenant isolation are covered and green under NOSUPERUSER. Effort: ~3–4 days · Risk: Low
E3 — Close employee-management loose ends 🔵
- Wire the bulk CSV import frontend (
apps/employer-web/.../add/bulk/page.tsxis a UI-only stub — no POST to/api/employees/bulk-import). - Make
bulkImportconsistent withcreate(handle compliance overrides; stop silently swallowing bad rows viaskipDuplicates). - Finish the
salaryType → paymentFrequencyrename (remnants intypes/employee.ts:33,EmployeeDetailContent.tsx, list pages, mock file) — the plan's own done-criterion #10. - Add cross-tenant FK ownership checks to
employee-absence/allowance/deduction(parent hasvalidateTenantFks; children don't). Acceptance: bulk import works end-to-end; nosalaryTyperemnants; child resources match the parent's tenant defense. Effort: ~2–3 days · Risk: Low–Medium
E4 — Close the auth plan deviation (optional, low priority) 🟡
- Either ship the planned unified
POST /api/organizations(any kind) or explicitly retire the goal; retire the duplicate FIprovision-admin/adminsonboarding path so there's one way to onboard an org. Effort: ~2–3 days · Risk: Low
✅ Phase 0 exit gate (all must be true to launch)
- Ledger + Bank Gateway deployed and reachable; boot fails if not (A1).
- Payroll run → ledger postings → bank disbursement → settlement webhook → repayment + notification, end-to-end, in a test (A2+A3+B2).
- RLS proven under a non-superuser role in CI (A4).
- No authenticated controller 401s; CI guard in place (A5).
- No FK crosses a context boundary; per-context schemas live (B1).
- Every MVP event is registry-validated and has a real consumer (B2).
-
bufbreaking-change gate green; payroll behind its port contract (B3). - Edge concerns live in one layer; no hand-maintained middleware list (B4+A5).
-
apps/api/srcreorganized into ~6 context folders + infra; cross-context import lint in place (D1). - Catalog CRUD collapsed to a factory; org/merchant/FI triplication gone (D2+D3).
- Shared controller utils deduped; overlapping modules merged; REST verbs uniform (D4+D5+D6).
-
packages/sharedconsolidated 13→8; deadvalidationdeleted (C4). - Auth RBAC enforcement core tested —
PermissionGuard+ provisioning + multi-org flow (E1). - Employee domain tested — CRUD + bulkImport + cross-tenant isolation under NOSUPERUSER (E2).
- Employee loose ends closed — bulk-import frontend wired,
salaryTyperename finished (E3). - ADR-018…025 written; lying + stale docs fixed (C1+C2).
Pragmatic note: D1–D6 are strongly recommended pre-launch (cheapest now, while there's no live money and the schema is being carved anyway) but are structure/maintainability, not correctness. If launch pressure forces a cut, the order to drop is D6 → D5 → D4 → D2 → D1 → D3 — but never ship without Track A.
PHASE 1 — First customers (~3–6 months)
Extract the highest-pressure boundaries. Each extraction = define/confirm contract → split DB → move code → deploy → delete the in-process path.
P1.1 — Extract Payroll → Go 🟡 (the polyglot proof)
- Confirm the
PayrollEnginePortproto (from B3) is stable + contract-tested. - Reimplement the payroll calculation engine in Go behind the same proto.
- Run TS and Go side-by-side on shadow traffic; compare outputs on real runs.
- Cut over; retire the NestJS payroll engine. Consumers unchanged (the proof). Acceptance: payroll runs served by the Go service with identical results; zero consumer changes. Risk: Medium (correctness parity — use shadow comparison).
P1.2 — Extract Notifications to a real service 🟡
- Define notification event/contract; build the Go service consuming
*.notifytopics. - Implement SMS (+ email) providers; own
notif_db. - Cut in-process TS notifications over to the service; delete the old path.
P1.3 — Extract Reconciliation 🟡
- Wire the gateway's statement-ingestion path (currently has no caller) to a real source.
- Build the standalone recon job: daily Ledger ↔ bank-statement match; break alerts. Acceptance: a seeded discrepancy is detected and alerted.
P1.4 — Add the lightweight Payments-Orchestrator 🟡
- DB-backed saga state machine + outbox + idempotency (no Temporal yet).
- Move the payroll→ledger→disbursement saga + compensation behind it.
P1.5 — Separate databases for extracted services 🟡
- Promote per-context schemas (B1) to separate databases for Payroll-Go, Ledger, Gateway, Notifications.
P1.6 — Launch flag-gated products as validated 🟡
- Turn on EWA / Lending / Equb per product readiness (code exists; respect the ADR-014 Equb custody gate).
P1.7 — Observability baseline 🟡
- OpenTelemetry tracing across gRPC + Kafka headers; Prometheus + Grafana; SLO alerts (disbursement success, DLQ depth, recon breaks).
✅ Phase 1 exit gate
- Payroll served by Go behind a stable contract, no consumer changes.
- Notifications + Reconciliation are independent services.
- Saga orchestrator owns money flows with compensation + idempotency.
- Extracted services have separate DBs.
- End-to-end traces exist for a payroll run.
PHASE 2 — Growth (as load warrants)
- Move to a small managed Kubernetes; namespaces by domain (
edge,identity,money,compliance,products,platform); NetworkPolicy locks the money namespace. - HPA on CPU + Kafka consumer lag; blue/green for Ledger; per-service CI (Nx affected graph).
- Migrate secrets to Vault (dynamic DB creds); field-level PII encryption.
- Extract Screening/Fraud as load warrants.
- Introduce Temporal only if saga complexity/scale justifies (replace the lightweight orchestrator).
✅ Phase 2 exit gate
- All MVP+Phase-1 services run on K8s with independent deploys + CI.
- Money services have blue/green + rollback proven.
- Secrets in Vault; no env-file secrets.
PHASE 3 — Regional scale (regional demand)
- DR: tested restores, WAL archiving, Kafka RF ≥ 3, documented RPO/RTO, per-failure runbooks.
- Active-passive region; ledger rebuild-from-log drill.
- Add a 2nd/3rd real bank adapter (Dashen / CBE / Telebirr / EthSwitch) behind the Bank Gateway port — bank-sandbox stays the dev adapter.
- Introduce Treasury to manage liquidity across multiple partner banks (its first real justification).
- New products on stable rails: Savings, Merchant/QR, Bill Payments.
- Service mesh (Linkerd) only if mTLS-everywhere / traffic-shifting demands it; mTLS service identity supersedes HMAC.
✅ Phase 3 exit gate
- ≥ 2 real partner banks live behind the same Gateway port.
- DR restore + region failover tested.
- Treasury reconciles float across partners.
PHASE 4 — Millions of users (scale demand)
- Shard Ledger by tenant; partition
entriesby(tenant, month); read replicas for balances. - Tiered Kafka storage; CQRS read models for all analytics (never query write DBs).
- Multi-region active-active with tenant-pinned data residency (NBE).
- PCI-scoped isolated cluster if Cards ship.
- Wallet considered as a product only if stored-value ever becomes the operating model (still not a platform dependency).
✅ Phase 4 exit gate
- Ledger sustains millions of entries/day with sharding + replicas.
- Multi-region active-active with data residency.
Dependency graph (what blocks what)
A1 (deploy Go tier) ─┐
A2 (payroll consumer)┼─► B2 (Kafka consumers) ─► Phase-0 e2e gate
A3 (webhook RLS) ────┘
A4 (RLS in CI) ──────► B1 (schema carve-out) ─┬─► D1 (code reorg, same boundaries) ─► P1.5 (separate DBs)
└─► D3 (tenancy carve-out / polymorphic org)
A5 (auth gap) ───────► B4 (edge layer) ──► E1 (auth tests cover the now-uniform surface)
A4 (RLS in CI) ──────────────────────────► E2 (employee tenant-isolation test runs under NOSUPERUSER)
B3 (contracts/buf) ──► P1.1 (Payroll→Go)
D2 (catalog factory) ─► (independent) D4/D5/D6 ─► (independent, mechanical)
E1/E2/E3 (test debt) ─► (independent; E2 needs A4's role; E3 mechanical)
C1/C2/C3/C4 ─────────► (independent, anytime; C4 = shared-pkg consolidation)
Critical path to launch: A1 → A2/A3 → B2 → e2e gate, in parallel with B1 (schema carve-out) and A4 (RLS proof). B3 must finish before Phase 1's Payroll→Go. Do D1 with B1 (same context boundaries — one reorg, not two) and D3 with the B1 Identity/Tenancy carve-out.
Progress tracker
| Workstream | Status | PR(s) | Owner |
|---|---|---|---|
| A1 Deploy Go tier | ◐ ledger+gateway+their DBs+migrate one-shots in root docker-compose.yml; api depends_on + bank-sandbox webhook wired; docker compose config VALID; README updated. Gateway gRPC reachability now surfaces via /readyz + demozpay_dependency_up (new IntegrationGatewayGrpcHealthIndicator, advisory like ledger). REMAINS: runtime boot verification (pnpm docker:up → migrate → curl healthz) | ||
| A2 Payroll consumer | ☐ | ||
| A3 Webhook RLS | ☐ | ||
| A4 RLS in CI | ☐ | ||
| A5 Auth gap | ◐ registered 8 missing controllers + /api/me/* in forRoutes (tsc/lint clean); REMAINS: CI-coverage guard, glob refactor, CourtOrderRemit (gated on A2), runtime smoke-test | ||
| B1 Schema carve-out | ☐ | ||
| B2 Kafka consumers | ☐ | ||
| B3 Contracts/buf | ☐ | ||
| B4 Edge layer | ☐ | ||
| D1 api code reorg by context | ◐ WORKFORCE done: 11 flat folders (employee + 3 child + 7 catalog) → apps/api/src/workforce/ (git mv, history preserved; imports fixed; app+spec tsc green; no new lint). src root 50→40. REMAINS: identity, payroll, money, compliance, products context folders + cross-context import lint | ||
| D2 Catalog CRUD factory | ☐ | ||
| D3 Polymorphic org / tenancy carve-out | ☐ | ||
| D4 Dedupe controller utils | ✅ AuthenticatedRequest+mustGetActor→shared-infra; parseSantimMaybeNull→shared-money. tsc+lint clean | ||
| D5 Merge identity-verification + tenant | ☐ | ||
| D6 REST consistency | ☐ | ||
| E1 Auth RBAC tests | ☐ | ||
| E2 Employee domain tests | ☐ | ||
| E3 Employee loose ends | ☐ | ||
| E4 Auth plan deviation (opt) | ☐ | ||
| C1 ADRs | ✅ ADR-018..025 written; index fixed (now lists 015-025); ADR-017 broken ADR-008 link fixed; all cross-refs resolve | ||
| C2 Doc fixes | ✅ payroll lies (MONEY_FLOWS×3, BANK_ORCHESTRATION); Equb custody overstatement (DOMAIN_KNOWLEDGE_BASE — regulator-facing); ESLint→scripts/ci/adr-guards.sh (CLAUDE.md + ADR-009); 90_DAY stale banner; audits/* counts (ROADMAP+CURRENT_STATE 18→41 ctrls/73→71 models/41→85 migs, DOC_AUDIT 17→25 ADRs); EMPLOYEE_MGMT M7/M8/M9 shipped; AUTH_SYSTEM_REVIEW staleness banner | ||
| C3 Dead-code delete | ◐ DONE: libs/ + apps/docs/ (empty); 2 dead ports removed (lending EqubBehaviorSignal — 3 files + bindings; kyc DocumentStoragePort — port+token+comments); payroll replay() deleted + false "verified" docstring corrected; tsc+lint clean. DEFERRED (decisions): docs-web (~10 refs), services/notifications (Phase-1 target); test-pr*.mjs → convert-then-delete in E | ||
| C4 shared 13→8 consolidation | ☐ |
(Status values: ☐ not started · ◐ in progress · ✅ done · ⊘ blocked)
Source of truth for the target state: ../architecture/TARGET_PLATFORM_ARCHITECTURE.md. Update this plan as steps land; keep the blueprint stable.