Skip to main content

Target Architecture Alignment Plan

Purpose: the step-by-step execution plan to move DemozPay from its current code to the Target Platform Architecture blueprint. The blueprint says where we're going and why; this document says how, in what order, and when it's done.

How to use: work top-to-bottom. Each workstream has checkboxes, acceptance criteria, and a phase exit gate that must be green before starting the next phase. Tick boxes as you land PRs; link the PR next to each box.

Related: ARCHITECTURE_ALIGNMENT_PLAN.md · ARCHITECTURE_ALIGNMENT_WORKLOG.md · PAYOUT_ROUTING_PLAN.md

Maturity legend (same as blueprint): 🟢 Today · 🔵 MVP · 🟡 Post-MVP · ⚪ Long-term


Guiding rules for every step

  1. Never break money to chase architecture. Track A (launch blockers) lands before any extraction work.
  2. The contract is permanent; the deployment is not. Define the proto/event schema before moving code.
  3. No step is "done" until deployed + wired + tested. Half-built ≠ done (this is the disease we're curing).
  4. One PR = one reversible step. Money-moving PRs satisfy ADR-005…ADR-013 and get founder review.
  5. Prove RLS under a non-superuser role for anything touching tenant data (local superuser hides every RLS hole).

At-a-glance phase map

PhaseGoalDurationExit gate
Phase 0Launch MVP: money flows work end-to-end; contract + schema spine in place~8 wks§Phase 0 exit gate
Phase 1First customers: extract Payroll→Go, Notifications, Recon; lightweight orchestrator~3–6 mo§Phase 1 exit gate
Phase 2Growth: small K8s, per-service CI, extract Screening/Fraudas load warrants§Phase 2 exit gate
Phase 3Regional scale: DR, 2nd/3rd bank adapter, Treasuryregional demand§Phase 3 exit gate
Phase 4Millions of users: shard Ledger, multi-region, CQRSscale demand§Phase 4 exit gate

PHASE 0 — Launch MVP (now → ~8 weeks)

Three parallel tracks. Track A is the critical path — it can ship independently and must land first. Track B is the architecture spine. Track C is cheap cleanup that can happen anytime.

Track A — Launch blockers (correctness; highest priority)

These are verified bugs that prevent money from moving end-to-end. None require the new architecture — they fix what exists.

A1 — Deploy the Go money tier 🔵

Current: services/ledger and services/integration-gateway have Dockerfiles but are not in docker-compose.yml; apps/api is configured with LEDGER_GRPC_ADDR: ledger:50051 and INTEGRATION_GATEWAY_GRPC_ADDR: integration-gateway:50052 (compose lines ~119-120) — dead hosts. Target: both services run in compose with their own databases; the API's gRPC dials resolve.

  • Add ledger service to docker-compose.yml (image build from services/ledger/Dockerfile, LEDGER_DATABASE_URL, port 50051).
  • Add integration-gateway service (GATEWAY_DATABASE_URL, port 50052).
  • Create their databases (separate DB or schema — see B1) + run their raw-SQL migrations on boot.
  • Add a boot-time health/dial check in apps/api that fails loudly if the ledger/gateway gRPC is unreachable (no more silent dead-host).
  • Update README.md Step-3 migrations + Step-4 health check to include both services. Acceptance: pnpm docker:up brings up ledger + gateway; apps/api logs a successful gRPC handshake to both; a manual ledger PostTransaction round-trips. Effort: ~1–2 days · Risk: Low

A2 — Enable the payroll→repayment consumer 🔵

Current: PayrollConsumersModule is commented out at apps/api/src/app/app.module.ts:172 with a TODO(boot-fix)RecordEwaRepaymentUseCase / RecordRepaymentUseCase aren't visible to its DI scope. So payroll.deductions_taken.v1 events pile up and no EWA/loan repayment is ever recorded. Target: the consumer is enabled and records repayments.

  • Restructure DI: move EwaModule.register() / LendingModule.register() into the @Global *ApiModule.imports and re-export, then import the @Global ApiModules from PayrollConsumersModule (the fix the TODO itself describes).
  • Uncomment PayrollConsumersModule in app.module.ts.
  • Add an integration test: payroll deduction event → EWA repayment recorded. Acceptance: approving a payroll run with an active EWA advance records a repayment in the ledger; test green. Effort: ~2–3 days · Risk: Medium (DI surgery)

A3 — Fix the Bank Gateway webhook RLS bug 🔵

DIAGNOSED (2026-06-18), not yet applied. Root cause confirmed: ListTenantsWithNonTerminal (services/integration-gateway/internal/store/postgres_store.go:199) calls s.pool.Query directly — no app.tenant_id — while every other method uses pg.WithTenantTx. Under FORCE RLS (migrations/0001_init.up.sql:200-206, USING tenant_id = current_setting('app.tenant_id', true)) it returns 0 rows → empty tenant list → webhook handler.go "no matching disbursement" → HTTP 200 with no state change. Hidden in dev because the gateway docker role is a Postgres superuser (bypasses RLS) — only manifests under a NOSUPERUSER prod role, so verification requires A4's harness. Nonce replay: adapters/dashen/signing.go:55-72 validates a 5-min timestamp window + HMAC but never records/checks the nonce → replayable for 5 min. Recommended fix: (1) a SECURITY DEFINER SQL function gateway_resolve_disbursement(partner, partner_ref) owned by a BYPASSRLS role (or a second pool on a dedicated BYPASSRLS role) for the trusted cross-tenant webhook lookup; (2) a webhook_nonce(partner, nonce, seen_at) UNIQUE table (or Redis SETNX, TTL = window) checked in verifyIncoming. Blocked here: no Go toolchain in this environment (can't compile/test); do in a Go env, sequenced after A4. Current: the gateway webhook handler does a cross-tenant read with no app.tenant_id set; under FORCE RLS it returns 0 rows, falls through to "no matching disbursement," and returns 200 without advancing state → async settlement silently never completes. No nonce/replay check (only a 5-min HMAC window). Target: settlement webhooks correctly match + advance the disbursement; replays are rejected.

  • Set tenant context (or use a scoped BYPASSRLS role) before the disbursement lookup in the gateway webhook path.
  • Persist + check the X-Demoz-Nonce to reject replays.
  • Add a test: a valid settlement webhook flips the disbursement to SETTLED; a replayed webhook is rejected. Acceptance: bank-sandbox settlement webhook advances state to SETTLED/FAILED; replay returns 409/ignored; test green. Effort: ~2–3 days · Risk: Medium (money path)

A4 — Prove RLS under a non-superuser role in CI 🔵

Current: local docker myuser is SUPERUSER + BYPASSRLS, so every FORCE RLS policy is silently bypassed; RLS correctness is unverified. Target: a CI job runs the cross-tenant denial tests under a NOSUPERUSER role.

  • Add a CI Postgres role app_rls_test (NOSUPERUSER, no BYPASSRLS).
  • Write/relocate cross-tenant denial tests (read + write) for each financial table behind RLS.
  • Wire into the CI pipeline; fail the build on any cross-tenant leak. Acceptance: CI proves tenant A cannot read/write tenant B's financial rows under the non-superuser role. Effort: ~3–4 days · Risk: Medium (may surface real RLS holes — that's the point)

A5 — Fix the SessionMiddleware 401 gap + add a guard 🔵

Current: apps/api/src/app/app.module.ts:219-273 is a hand-maintained .forRoutes(...) class list; ≥9 authenticated controllers are missing (PayrollPlatformAdmin, AutoLockPolicy, CourtOrderRemit, PayrollAudit, PayrollPdf, PayrollSettlement, BankWebhookReplay, and the entire /api/me/* self-service surface — Me + MeEqub) → they 401 on every authenticated request. CLAUDE.md admits only 2 of these; the real gap is larger and includes all of employee-web's self-service (profile, loans, private Equb). Target: auth middleware applies uniformly; drift is impossible.

  • Replace the manual class list with a path-glob / global middleware that skips only @Public() + health/metrics.
  • Add a CI guard (or test) that fails if any non-@Public() controller is unreachable through the middleware chain.
  • Smoke-test the previously-broken routes (/api/me/*, payroll PDF/audit/settlement). Acceptance: all authenticated controllers resolve req.user; CI guard catches a deliberately-unregistered controller. Effort: ~1 day · Risk: Low

Track B — Contract & schema spine (the architecture foundation)

This is what makes every later extraction a swap, not a rewrite. Do it pre-launch while it's cheap.

B1 — Carve the shared Prisma schema into per-context schemas 🔵 ⭐ spine

Current: one apps/api/prisma/schema.prisma (~3,074 lines, ~71 models) with cross-domain foreign keys (payroll → employee → org). Target: per-context Postgres schemas in one instance (Identity, Tenancy, Workforce, Payroll, KYC) — no cross-context FK, no cross-context join; cross-context references by ID.

  • Map every model to an owning context (Identity / Tenancy / Workforce / Payroll / KYC / Money). Produce an ownership table (mirror §9 of the blueprint).
  • Identify + list every cross-context FK that must become an ID reference.
  • Migrate models into per-context Postgres schemas (iam, tenancy, workforce, payroll, kyc); drop cross-context FKs; replace with by-ID references + app-level validation.
  • Update repositories/adapters to resolve cross-context refs via the owning context (port call), not a join.
  • Keep RLS FORCE per context; re-run A4 tests. Acceptance: no FK crosses a context boundary (verified by a schema lint); all existing flows pass; RLS tests green per context. Effort: ~2–3 weeks · Risk: High (the biggest Phase-0 item — do it carefully, one context at a time, expand-contract migrations). Dependency: do before any Phase-1 "separate database" work.

B2 — Wire Kafka with real consumers + schema registry 🔵

Current: outbox → Kafka publisher exists but zero consumers (real consumers poll Postgres); OUTBOX_PUBLISHER_ENABLED defaults false; events are stringly-typed (payload as any). Target: transactional outbox → Redpanda → real idempotent consumers; protobuf event schemas in a registry; CI backward-compat gate.

  • Stand up a schema registry (Redpanda registry) in compose.
  • Define protobuf schemas for the MVP events: demoz.payroll.run.approved.v1, demoz.ledger.entry.posted.v1, demoz.disbursement.settled.v1, demoz.kyc.approved.v1.
  • Convert the outbox publisher to publish schema-validated events; turn OUTBOX_PUBLISHER_ENABLED on by default in dev.
  • Move the real cross-context reactions onto Kafka consumers: payroll→repayment (ties to A2), →notifications, →recon-input.
  • Make every consumer idempotent (dedup on event ID) + add a per-consumer DLQ topic + alert.
  • Delete or quarantine the ~110 phantom event types that have no consumer (keep the audit-only ones explicitly labeled). Acceptance: a payroll approval produces a registry-validated event consumed by the repayment + notification consumers; replaying the event is a no-op (idempotent); a poisoned message lands in DLQ. Effort: ~1–1.5 weeks · Risk: Medium Dependency: A2 (repayment use cases) for the payroll consumer.

B3 — Promote packages/contracts to the integration law 🔵

Current: protos exist for ledger/gateway but the boundary is real only there; no breaking-change gate; payroll has no extraction seam. Target: contract-first for every MVP boundary; buf breaking-change CI gate; payroll behind a gRPC-shaped contract so the Phase-1 Go swap is invisible.

  • Add a buf lint + breaking-change check to CI against packages/contracts/grpc.
  • Define a PayrollEnginePort proto (calculate/approve/disburse run) — even though payroll stays NestJS, the contract exists now.
  • Add consumer-driven contract tests for ledger + gateway clients.
  • Document the proto→buf generate→implement workflow in packages/contracts/README.md. Acceptance: a breaking proto change fails CI; payroll is callable through its port contract; contract tests green. Effort: ~1 week · Risk: Low

B4 — Establish the API Gateway / edge layer 🔵

Current: the NestJS apps/api is the API; edge concerns (auth, idempotency-key minting, rate-limit, API versioning, tenant routing) are scattered across middleware/guards, and the SessionMiddleware list is hand-maintained (see A5). Target: a single, explicit edge layer owning auth + idempotency + rate-limit + versioning. (A separate gateway deployable / per-audience BFFs are Post-MVP — at MVP this is a clean layer inside the existing app, per blueprint §3.)

  • Consolidate edge concerns into one clearly-named module/layer (app/edge or a gateway module): global auth (ties to A5), idempotency-key minting (ADR-007), rate-limit, API version prefix (/api/v1).
  • Document the edge contract so a real gateway/BFF can be lifted out later without touching domain code.
  • (Defer) separate BFF deployables per audience → Phase 1/2. Acceptance: all edge responsibilities live in one place; adding a route doesn't require editing a hand-maintained middleware list. Effort: ~2–3 days · Risk: Low

Track D — apps/api code structure & bounded-context alignment

Why this exists (the gap a senior caught): apps/api/src has 50 flat folders / 46 modules that don't reflect the architecture. They actually belong to ~6 bounded contexts + infra. The DB carve-out (B1) and the code carve-out (this track) are the same boundary and should be done together — otherwise the schema says "contexts" while the folders still say "50 features."

Current → target folder mapping:

Bounded contextToday (flat folders)Target
Infra / cross-cutting (~18)app, config, email, sms, health, observability, prisma, idempotency, outbox, dead-letter, resilience, scheduling, shared-infra, uploads, notification-consumers, grpc-auth, assets, typesapps/api/src/_infra/* (or keep flat; clearly separated from domains)
Identity / Tenancy (~12)auth, tenancy, tenant, organization, organization-provisioning, members, roles, platform-staff, merchant, financial-institution, me, identity-verificationapps/api/src/identity/*
Workforce (~11)employee, employee-absence, employee-allowance, employee-deduction, absence-type, allowance, deduction, department, position, employment-type, payment-frequencyapps/api/src/workforce/*
Payroll (2)payroll, payroll-consumersapps/api/src/payroll/*
Money (2)banking, integrationapps/api/src/money/*
Compliance (2)kyc, sanctions (+ identity-verification)apps/api/src/compliance/*
Products (3)ewa, lending, equbapps/api/src/products/*

"Did it follow the domain-package pattern?" — verified module classification. This is the map of which modules need the most work. Bucket 1 is done right; the alignment effort is promoting Buckets 2 and 3 up to it.

BucketMeaningModulesAction
1 — Proper package + thin API wrapperpackages/<domain>/backend exists; API module just wires it; Prisma only in infra adapterspayroll, ewa, lending, equb, kyc, sanctions, bankingKeep as the reference shape
2 — Half-carved ⚠️tenancy package exists but is only partly adopted: organization / merchant / financial-institution / roles wrap it yet still carry their own Prisma (the ~1,000-LOC triplication); members / platform-staff / organization-provisioning ignore the package entirelytenancy + organization, merchant, financial-institution, roles, members, platform-staff, organization-provisioningD3 — finish the carve-out; move logic into the tenancy package; one polymorphic controller
3 — Ad-hoc, no packagelogic + Prisma live directly in apps/api/src; no domain package existsWorkforce: employee + the 10 catalog/child modules (department, position, employment-type, payment-frequency, allowance, deduction, absence-type, employee-allowance/deduction/absence); identity-verification (belongs in compliance/kyc); auth (wiring OK in api, but no domain package — leans on shared-auth + tenancy)D1/D2 (collapse catalog) + create a Workforce module/package boundary; D5 folds identity-verification into compliance
Infra (correctly ad-hoc)no package neededemail, sms, uploads, outbox, dead-letter, scheduling, idempotency, resilience, integration, me, health, observabilityLeave as-is

The irony to internalize: payroll is a clean package, but employee — which payroll depends on — is entirely ad-hoc, and tenancy is half-and-half (logic duplicated in both the package and three fat controllers). The money/compliance half respected the architecture; the trust/identity half didn't. B1 (schema) and D1/D3 (code) must move Buckets 2 + 3 to match Bucket 1 on the same context boundaries.

D1 — Reorganize apps/api/src into bounded-context folders 🔵

Pair with B1 — same context boundaries for code and schema.

  • Group the 50 folders under context parents (table above); update Nx tags + tsconfig paths + import paths.
  • Add an ESLint/Nx boundary rule: no cross-context import except via the context's public entry (mirrors ADR-011 inside apps/api). Acceptance: apps/api/src shows ~6 context folders + infra, not 50 flat ones; cross-context imports fail lint. Effort: ~3–5 days (mostly moves + import fixups) · Risk: Low–Medium (churn; do as one mechanical PR).

D2 — Collapse the 10 catalog CRUD modules into one factory 🔵

Current: absence-type, allowance, deduction, department, position, employment-type, payment-frequency (+ employee-allowance/deduction/absence) are ~identical list/getOne/create/update/archive + audit + outbox modules (~77–94-line controllers each).

  • Build a generic TenantCatalog<T> module factory (model name + permission tuple + DTO as config).
  • Migrate the 10 modules onto it; delete the boilerplate (~2,000 LOC → ~300). Acceptance: one factory drives all catalog resources; behavior + tests unchanged. Effort: ~3–5 days · Risk: Medium

D3 — One polymorphic Organization controller; finish the tenancy carve-out 🔵

Current: organization, merchant, financial-institution re-implement the same :id/admins/*, :id/team-members/*, :id/documents/* route trees over the same @demoz-pay/tenancy use-cases (~1,000 LOC triplication, differing only by a document-catalog constant); organization/merchant/platform-staff/financial-institution/members carry domain logic + raw Prisma in fat controllers/services instead of the tenancy package.

  • Extract one polymorphic OrganizationKindController keyed on kind (business/merchant/FI); collapse the triplication.
  • Move the remaining org/merchant/FI/platform-staff/members business logic into the tenancy package (it's the real owner).
  • Remove direct PrismaService/Prisma injection from controllers. Acceptance: one controller serves all org kinds; no domain logic or raw Prisma left in those controllers. Effort: ~1–2 weeks · Risk: Medium–High (largest apps/api refactor) · Pairs with B1 Identity/Tenancy carve-out.

D4 — Dedupe shared controller utilities 🔵

Current: interface AuthenticatedRequest ×22, function mustGetActor ×14, parseSantimMaybeNull ×4 — copy-pasted.

  • Move AuthenticatedRequest + mustGetActor into one shared auth util (shared-infra or @demoz-pay/shared-auth).
  • Move parseSantimMaybeNull into @demoz-pay/shared-money.
  • Replace all copies with the import. Acceptance: one declaration of each; grep counts drop to 1. Effort: ~1 day · Risk: Low

D5 — Co-locate / relate overlapping modules 🔵

Investigated 2026-06-19 — NOT a mechanical merge. identity-verification (@Controller('identity'): /identity/start, /identity/sessions/:id — a pluggable IdentityVerifier mechanism: Fayda-mock + manual-upload adapters producing a VerifiedIdentity) and kyc (packages/kyc: the submission lifecycle submit→review→approve/reject, sanctions-gated) have zero cross-references — they are complementary, not duplicative, and currently disconnected (a latent gap: a KYC submission should be able to consume an identity-verification session). Folding one into the other would lose distinct functionality.

  • Design decision first (not a cleanup): should the kyc submission flow consume the IdentityVerifier mechanism as its verification backend? If yes → make identity-verification the verifier adapter behind a kyc port (a feature, needs a runnable env to verify). If no → leave as sibling surfaces but co-locate both under the Compliance/Identity context folder (D1) without merging code.
  • Merge tenant/ (the TenantContextMiddleware) into the Identity context next to tenancy/ — one home for the tenancy concept. (This half is a safe move.) Acceptance: the kyc↔identity-verification relationship is decided + documented; tenant//tenancy/ co-located. Effort: ~1 day (co-locate) / feature-sized (if wiring kyc→verifier) · Risk: Low (co-locate) / Medium (wiring)

D6 — REST consistency pass 🔵

Current: employee + organization use @Put(':id'); the other 14 update routes use @Patch; archive (POST) vs DELETE is split.

  • Standardize partial-update on PATCH everywhere; standardize soft-remove on POST :id/archive (ADR-009 spirit). Acceptance: verb usage is uniform across controllers. Effort: ~0.5 day · Risk: Low

Track C — Documentation, ADRs & dead-code cleanup (cheap, anytime)

C1 — Write the new ADRs 🔵

Target: decision records behind the blueprint (referenced in §17).

  • ADR-018 Progressive extraction behind stable contracts
  • ADR-019 Database-per-service (per-context schemas at MVP)
  • ADR-020 Contract-first integration (buf + schema registry)
  • ADR-021 Lightweight saga orchestration; Temporal deferred
  • ADR-022 Kafka naming / ownership / versioning / DLQ / replay
  • ADR-023 API Gateway + BFF; REST at edge only
  • ADR-024 Custody model: partner banks hold funds, Ledger records, no internal wallet
  • ADR-025 Bank Gateway adapter contract; bank-sandbox is a dev adapter
  • Fix docs/adr/README.md index (currently stops at ADR-014) + the broken ADR-017 link. Effort: ~1–2 days · Risk: None

C2 — Fix lying / stale docs 🔵

  • docs/architecture/MONEY_FLOWS.md Flow 6, BANK_ORCHESTRATION.md Path D, 90_DAY_EXECUTION_PLAN.md:289 — remove "payroll does not exist" claims.
  • DOMAIN_KNOWLEDGE_BASE.md §7 — correct the Equb "real escrow" overstatement to "simulated pool, no custody" (regulator-facing risk).
  • Correct the ADR-009 / CLAUDE.md "ESLint guard" wording (enforcement is DB triggers, not ESLint).
  • Refresh docs/audits/* counts (85 migrations / 71 models / 17→25 ADRs / ~27 controllers).
  • Fix stale build plans: EMPLOYEE_MGMT_BUILD_PLAN.md claims M7/M8/M9 "not started" but they're shipped; docs/security/AUTH_SYSTEM_REVIEW.md flags MFA/rate-limit/auth-event-log as missing but they're built. Update both to match code. Effort: ~1 day · Risk: None

C3 — Delete dead code/dirs ⚪→do-now

  • rm -rf libs/ (empty, contradicts ADR-002).
  • Delete apps/docs-web (0-LOC Docusaurus stub) or fill it.
  • Delete services/notifications (28-LOC stub; notifications are in-process TS until Phase 1).
  • Delete scripts/test-pr*.mjs (19 throwaway scripts).
  • Delete the two dead ports (lending/equb-behavior-signal.port.ts, kyc DocumentStoragePort) and payroll replay() (or fix+test it). Effort: ~0.5 day · Risk: Low

C4 — Consolidate packages/shared (13 → 8) 🔵

Current: 13 shared packages; median ~107 LOC, 5 are 16–67 LOC, each shipping package.json + 3 tsconfigs + own node_modules. One is dead.

  • Delete shared/validation (403 LOC, 0 consumers — dead husk from the old libs/schema; remove its tsconfig.base.json path alias too).
  • Merge 5 interface-only seams into @demoz-pay/shared-kernel: database (16) + audit (38) + events (64) + idempotency (133) + logging (107). They're contract + test-double only, share the same consumer set (api/ewa/lending), and all cluster around the ADR-008 transaction. (~358 LOC → one coherent kernel.)
  • Update the ~150 import sites + tsconfig.base.json aliases.
  • Keep untouched (real, multi-context): money, auth, ui, compliance, config, frontend-auth, tenant-context. Note: shared/auth is the RBAC brain (66 consumers) — the merge does not touch it, so auth/employee work is unaffected. Acceptance: packages/shared has 8 packages; validation gone; all imports resolve; build green. Effort: ~2–3 days · Risk: Low–Medium (mechanical import churn)

Track E — Verify the "done" features (test debt — launch quality)

Why this exists: auth and employee management are built (largely as planned, in places ahead of it) but barely tested. For a fintech, "it works" without automated proof of tenant isolation + permission enforcement is a launch risk, not a done feature. (Verified: the PermissionGuard that gates payroll approve/disburse has zero tests; the employee domain has zero spec files.)

E1 — Test the auth / RBAC enforcement core 🔵

Current: only OrgRoleGuard + 2 policies are tested. PermissionGuard (the money-route enforcer), RolesService kind-validation, provisioning, PlatformPermissionGuard, AdminMfaGuard, and the whole multi-org happy path have no tests. (Spans apps/api/src/identity/auth + packages/shared/auth + packages/tenancy.)

  • Unit-test PermissionGuard + PlatformPermissionGuard (every allow/deny branch).
  • Unit-test RolesService.validatePermissionsForKind + can't-grant-unowned + reserved-name.
  • Unit-test OrganizationProvisioningService (one audit row per grant; BA-user compensation; ADR-008 tx).
  • One integration test for the multi-org flow: provision → magic link → set password → kind routing → invite → accept → MemberRole copy. Acceptance: every RBAC enforcement path has a passing test; the money-route guard is covered. Effort: ~3–5 days · Risk: Low (may surface real auth bugs — the point)

E2 — Test the employee domain (esp. tenant isolation) 🔵

Current: zero spec files for employee/department/position/catalog/child resources; bulkImport untested; tenant-isolation untested (and local superuser can't catch RLS regressions by hand).

  • Unit-test employee.service create/update/archive + validateTenantFks + override-requires-reason → 400.
  • Test bulkImport (per-row FK validation; reject vs silently-skip behavior).
  • Cross-tenant isolation test for Employee + Department under the A4 non-superuser role (ties to A4). Acceptance: employee CRUD + bulkImport + tenant isolation are covered and green under NOSUPERUSER. Effort: ~3–4 days · Risk: Low

E3 — Close employee-management loose ends 🔵

  • Wire the bulk CSV import frontend (apps/employer-web/.../add/bulk/page.tsx is a UI-only stub — no POST to /api/employees/bulk-import).
  • Make bulkImport consistent with create (handle compliance overrides; stop silently swallowing bad rows via skipDuplicates).
  • Finish the salaryType → paymentFrequency rename (remnants in types/employee.ts:33, EmployeeDetailContent.tsx, list pages, mock file) — the plan's own done-criterion #10.
  • Add cross-tenant FK ownership checks to employee-absence/allowance/deduction (parent has validateTenantFks; children don't). Acceptance: bulk import works end-to-end; no salaryType remnants; child resources match the parent's tenant defense. Effort: ~2–3 days · Risk: Low–Medium

E4 — Close the auth plan deviation (optional, low priority) 🟡

  • Either ship the planned unified POST /api/organizations (any kind) or explicitly retire the goal; retire the duplicate FI provision-admin/admins onboarding path so there's one way to onboard an org. Effort: ~2–3 days · Risk: Low

✅ Phase 0 exit gate (all must be true to launch)

  • Ledger + Bank Gateway deployed and reachable; boot fails if not (A1).
  • Payroll run → ledger postings → bank disbursement → settlement webhook → repayment + notification, end-to-end, in a test (A2+A3+B2).
  • RLS proven under a non-superuser role in CI (A4).
  • No authenticated controller 401s; CI guard in place (A5).
  • No FK crosses a context boundary; per-context schemas live (B1).
  • Every MVP event is registry-validated and has a real consumer (B2).
  • buf breaking-change gate green; payroll behind its port contract (B3).
  • Edge concerns live in one layer; no hand-maintained middleware list (B4+A5).
  • apps/api/src reorganized into ~6 context folders + infra; cross-context import lint in place (D1).
  • Catalog CRUD collapsed to a factory; org/merchant/FI triplication gone (D2+D3).
  • Shared controller utils deduped; overlapping modules merged; REST verbs uniform (D4+D5+D6).
  • packages/shared consolidated 13→8; dead validation deleted (C4).
  • Auth RBAC enforcement core testedPermissionGuard + provisioning + multi-org flow (E1).
  • Employee domain tested — CRUD + bulkImport + cross-tenant isolation under NOSUPERUSER (E2).
  • Employee loose ends closed — bulk-import frontend wired, salaryType rename finished (E3).
  • ADR-018…025 written; lying + stale docs fixed (C1+C2).

Pragmatic note: D1–D6 are strongly recommended pre-launch (cheapest now, while there's no live money and the schema is being carved anyway) but are structure/maintainability, not correctness. If launch pressure forces a cut, the order to drop is D6 → D5 → D4 → D2 → D1 → D3 — but never ship without Track A.


PHASE 1 — First customers (~3–6 months)

Extract the highest-pressure boundaries. Each extraction = define/confirm contract → split DB → move code → deploy → delete the in-process path.

P1.1 — Extract Payroll → Go 🟡 (the polyglot proof)

  • Confirm the PayrollEnginePort proto (from B3) is stable + contract-tested.
  • Reimplement the payroll calculation engine in Go behind the same proto.
  • Run TS and Go side-by-side on shadow traffic; compare outputs on real runs.
  • Cut over; retire the NestJS payroll engine. Consumers unchanged (the proof). Acceptance: payroll runs served by the Go service with identical results; zero consumer changes. Risk: Medium (correctness parity — use shadow comparison).

P1.2 — Extract Notifications to a real service 🟡

  • Define notification event/contract; build the Go service consuming *.notify topics.
  • Implement SMS (+ email) providers; own notif_db.
  • Cut in-process TS notifications over to the service; delete the old path.

P1.3 — Extract Reconciliation 🟡

  • Wire the gateway's statement-ingestion path (currently has no caller) to a real source.
  • Build the standalone recon job: daily Ledger ↔ bank-statement match; break alerts. Acceptance: a seeded discrepancy is detected and alerted.

P1.4 — Add the lightweight Payments-Orchestrator 🟡

  • DB-backed saga state machine + outbox + idempotency (no Temporal yet).
  • Move the payroll→ledger→disbursement saga + compensation behind it.

P1.5 — Separate databases for extracted services 🟡

  • Promote per-context schemas (B1) to separate databases for Payroll-Go, Ledger, Gateway, Notifications.

P1.6 — Launch flag-gated products as validated 🟡

  • Turn on EWA / Lending / Equb per product readiness (code exists; respect the ADR-014 Equb custody gate).

P1.7 — Observability baseline 🟡

  • OpenTelemetry tracing across gRPC + Kafka headers; Prometheus + Grafana; SLO alerts (disbursement success, DLQ depth, recon breaks).

✅ Phase 1 exit gate

  • Payroll served by Go behind a stable contract, no consumer changes.
  • Notifications + Reconciliation are independent services.
  • Saga orchestrator owns money flows with compensation + idempotency.
  • Extracted services have separate DBs.
  • End-to-end traces exist for a payroll run.

PHASE 2 — Growth (as load warrants)

  • Move to a small managed Kubernetes; namespaces by domain (edge, identity, money, compliance, products, platform); NetworkPolicy locks the money namespace.
  • HPA on CPU + Kafka consumer lag; blue/green for Ledger; per-service CI (Nx affected graph).
  • Migrate secrets to Vault (dynamic DB creds); field-level PII encryption.
  • Extract Screening/Fraud as load warrants.
  • Introduce Temporal only if saga complexity/scale justifies (replace the lightweight orchestrator).

✅ Phase 2 exit gate

  • All MVP+Phase-1 services run on K8s with independent deploys + CI.
  • Money services have blue/green + rollback proven.
  • Secrets in Vault; no env-file secrets.

PHASE 3 — Regional scale (regional demand)

  • DR: tested restores, WAL archiving, Kafka RF ≥ 3, documented RPO/RTO, per-failure runbooks.
  • Active-passive region; ledger rebuild-from-log drill.
  • Add a 2nd/3rd real bank adapter (Dashen / CBE / Telebirr / EthSwitch) behind the Bank Gateway port — bank-sandbox stays the dev adapter.
  • Introduce Treasury to manage liquidity across multiple partner banks (its first real justification).
  • New products on stable rails: Savings, Merchant/QR, Bill Payments.
  • Service mesh (Linkerd) only if mTLS-everywhere / traffic-shifting demands it; mTLS service identity supersedes HMAC.

✅ Phase 3 exit gate

  • ≥ 2 real partner banks live behind the same Gateway port.
  • DR restore + region failover tested.
  • Treasury reconciles float across partners.

PHASE 4 — Millions of users (scale demand)

  • Shard Ledger by tenant; partition entries by (tenant, month); read replicas for balances.
  • Tiered Kafka storage; CQRS read models for all analytics (never query write DBs).
  • Multi-region active-active with tenant-pinned data residency (NBE).
  • PCI-scoped isolated cluster if Cards ship.
  • Wallet considered as a product only if stored-value ever becomes the operating model (still not a platform dependency).

✅ Phase 4 exit gate

  • Ledger sustains millions of entries/day with sharding + replicas.
  • Multi-region active-active with data residency.

Dependency graph (what blocks what)

A1 (deploy Go tier) ─┐
A2 (payroll consumer)┼─► B2 (Kafka consumers) ─► Phase-0 e2e gate
A3 (webhook RLS) ────┘
A4 (RLS in CI) ──────► B1 (schema carve-out) ─┬─► D1 (code reorg, same boundaries) ─► P1.5 (separate DBs)
└─► D3 (tenancy carve-out / polymorphic org)
A5 (auth gap) ───────► B4 (edge layer) ──► E1 (auth tests cover the now-uniform surface)
A4 (RLS in CI) ──────────────────────────► E2 (employee tenant-isolation test runs under NOSUPERUSER)
B3 (contracts/buf) ──► P1.1 (Payroll→Go)
D2 (catalog factory) ─► (independent) D4/D5/D6 ─► (independent, mechanical)
E1/E2/E3 (test debt) ─► (independent; E2 needs A4's role; E3 mechanical)
C1/C2/C3/C4 ─────────► (independent, anytime; C4 = shared-pkg consolidation)

Critical path to launch: A1 → A2/A3 → B2 → e2e gate, in parallel with B1 (schema carve-out) and A4 (RLS proof). B3 must finish before Phase 1's Payroll→Go. Do D1 with B1 (same context boundaries — one reorg, not two) and D3 with the B1 Identity/Tenancy carve-out.


Progress tracker

WorkstreamStatusPR(s)Owner
A1 Deploy Go tier◐ ledger+gateway+their DBs+migrate one-shots in root docker-compose.yml; api depends_on + bank-sandbox webhook wired; docker compose config VALID; README updated. Gateway gRPC reachability now surfaces via /readyz + demozpay_dependency_up (new IntegrationGatewayGrpcHealthIndicator, advisory like ledger). REMAINS: runtime boot verification (pnpm docker:up → migrate → curl healthz)
A2 Payroll consumer
A3 Webhook RLS
A4 RLS in CI
A5 Auth gap◐ registered 8 missing controllers + /api/me/* in forRoutes (tsc/lint clean); REMAINS: CI-coverage guard, glob refactor, CourtOrderRemit (gated on A2), runtime smoke-test
B1 Schema carve-out
B2 Kafka consumers
B3 Contracts/buf
B4 Edge layer
D1 api code reorg by context◐ WORKFORCE done: 11 flat folders (employee + 3 child + 7 catalog) → apps/api/src/workforce/ (git mv, history preserved; imports fixed; app+spec tsc green; no new lint). src root 50→40. REMAINS: identity, payroll, money, compliance, products context folders + cross-context import lint
D2 Catalog CRUD factory
D3 Polymorphic org / tenancy carve-out
D4 Dedupe controller utilsAuthenticatedRequest+mustGetActor→shared-infra; parseSantimMaybeNull→shared-money. tsc+lint clean
D5 Merge identity-verification + tenant
D6 REST consistency
E1 Auth RBAC tests
E2 Employee domain tests
E3 Employee loose ends
E4 Auth plan deviation (opt)
C1 ADRs✅ ADR-018..025 written; index fixed (now lists 015-025); ADR-017 broken ADR-008 link fixed; all cross-refs resolve
C2 Doc fixes✅ payroll lies (MONEY_FLOWS×3, BANK_ORCHESTRATION); Equb custody overstatement (DOMAIN_KNOWLEDGE_BASE — regulator-facing); ESLint→scripts/ci/adr-guards.sh (CLAUDE.md + ADR-009); 90_DAY stale banner; audits/* counts (ROADMAP+CURRENT_STATE 18→41 ctrls/73→71 models/41→85 migs, DOC_AUDIT 17→25 ADRs); EMPLOYEE_MGMT M7/M8/M9 shipped; AUTH_SYSTEM_REVIEW staleness banner
C3 Dead-code delete◐ DONE: libs/ + apps/docs/ (empty); 2 dead ports removed (lending EqubBehaviorSignal — 3 files + bindings; kyc DocumentStoragePort — port+token+comments); payroll replay() deleted + false "verified" docstring corrected; tsc+lint clean. DEFERRED (decisions): docs-web (~10 refs), services/notifications (Phase-1 target); test-pr*.mjs → convert-then-delete in E
C4 shared 13→8 consolidation

(Status values: ☐ not started · ◐ in progress · ✅ done · ⊘ blocked)


Source of truth for the target state: ../architecture/TARGET_PLATFORM_ARCHITECTURE.md. Update this plan as steps land; keep the blueprint stable.