DemozPay — Technical Architecture & Engineering Blueprint
⚠️ Partially superseded — 2026-05-23. This document was written before the May 2026 restructure (
docs/architecture/restructure-2026-05.md). Folder references tolibs/api/*,libs/shared/*,apps/business,apps/client,apps/bnpl-partner,apps/fi,apps/admin,apps/docs,apps/ledger,apps/integration-gateway,apps/notificationsare stale. The principles (modular monolith, ledger isolation, money correctness, idempotency, outbox, tenant scoping, two-language ceiling) remain authoritative and are codified indocs/adr/ADR-001 through ADR-011.Authoritative locations now:
- Architecture decisions:
docs/adr/- Folder structure:
PROJECT_STRUCTURE.md+docs/architecture/restructure-2026-05.md- Per-domain conventions:
packages/README.md- Shared infra:
packages/shared/README.mdThis document is preserved for historical context — it's the blueprint that informed the restructure. Do not author new decisions here; write an ADR.
Original blueprint
Payroll-linked financial infrastructure platform for Ethiopia and the wider African market.
Audience: founding engineering team, CTO, platform architects, security lead, SRE lead. Goal: a production-grade, fintech-grade architecture that is executable in the next 90 days and scalable to 1M+ users without throwing away the original system. Status: v1.0 reference blueprint.
Table of Contents
- Executive Summary & Architectural Stance
- System Architecture — Modular Monolith → Microservices
- Service Catalog — Microservice Design
- Fintech Core — Ledger, Money Movement, Integrity
- Security Architecture
- Infrastructure & DevOps
- Database & Data Architecture
- Language & Framework Decisions
- API Design
- Financial Integrations (Banks, Wallets, MFIs)
- Product Engineering Strategy (0 → 1M users)
- Team Structure & Operating Model
- Implementation Roadmap
- Appendix — Decision Records & Anti-Patterns to Avoid
1. Executive Summary & Architectural Stance
DemozPay is not "an EWA app." It is a payroll-anchored financial identity and money-movement platform that emits trustworthy income data and converts it into financial access (EWA, salary loans, BNPL, Equb, savings, payroll financing).
Two architectural truths drive everything in this document:
- Payroll is the data moat. Every product downstream — credit decisioning, EWA limits, BNPL approval, payroll-backed liquidity — derives correctness from the payroll engine and the ledger. Both must be boringly correct before anything else gets fancy.
- Velocity > purity at startup phase, but correctness is non-negotiable for money. We deliberately accept "less elegant" architecture in non-financial surfaces (admin UIs, notifications, reporting) in exchange for shipping faster. We do not compromise on the ledger, idempotency, audit, and authn/authz.
Headline decisions
| Area | Decision | Rationale (one line) |
|---|---|---|
| Topology | Modular monolith + 3 satellite services from day 1 (Ledger, Integration Gateway, Notification Worker) | Operational simplicity now, clean boundaries for later carve-out. |
| Primary language | TypeScript / NestJS for product services; Go for ledger, integration adapters, settlement, high-throughput workers | Hiring in Ethiopia + ecosystem maturity + correctness for money services. |
| Datastore | PostgreSQL as the single source of truth for money; Redis for state cache + rate limits; Kafka (or Redpanda) for events; ClickHouse for analytics | PG is the only DB we trust for ledger-grade ACID. |
| Comms | REST + Webhooks externally; gRPC internally on the critical path; Kafka for async fan-out | Right tool per surface. |
| Deployment | Kubernetes (managed) + GitOps (ArgoCD) + Terraform | Standard, reversible, hireable. |
| Data residency | Primary data plane in-country (Ethiopia) for regulated data; analytics/observability copy in regional cloud (AWS Cape Town / Azure South Africa) with NBE-aligned controls | NBE data residency, latency to local banks/wallets, sovereignty. |
| Identity | OAuth2 / OIDC with short-lived JWTs + refresh; step-up MFA for high-value ops; mTLS between services | Zero-trust baseline. |
| Ledger | Double-entry, append-only, immutable journal with idempotency keys and saga-based orchestration | Non-negotiable for any financial system. |
What we are explicitly not doing in year one
- Not building our own payment switch.
- Not running our own KYC / national-ID matching — we integrate with Fayda (Ethiopian National ID) and partner KYC vendors.
- Not buying a core banking system — we build a thin ledger we fully own and integrate to banks/wallets for settlement.
- Not multi-region active-active. Active-passive DR is enough at startup scale.
- Not service-mesh-at-launch (Istio/Linkerd). mTLS via cert-manager + simple ingress is sufficient until > 10 services.
- Not event sourcing the whole system. Event sourcing only inside the ledger journal.
2. System Architecture
2.1 Why a modular monolith first (and where the exceptions are)
Microservices-from-day-one is the most common failure mode of fintech startups in Africa (and globally). Reasons it fails at our stage:
- Distributed transactions are hard. Salary disbursement that crosses 4 services with no shared transaction will produce silent inconsistency by month 3.
- Operational overhead. 15 services × CI/CD × on-call × dashboards × dependency upgrades = a team of 6 engineers doing platform work, not product.
- Premature service boundaries. Domain boundaries take 6–9 months to stabilize. Drawing them too early means rewriting them.
So we do this:
- One modular monolith ("DemozPay Core") containing: Identity, Tenancy, Employee, Payroll, Wallet (logical), EWA, Loans, BNPL, Equb, Reporting orchestration, Admin.
- Three carved-out services from day 1:
- Ledger Service — strict isolation. Owns the journal. Only API. Different language (Go). Different database. Hardened ops.
- Integration Gateway — all outbound calls to banks/wallets/MFIs go through one process. Centralizes circuit breakers, retries, secrets, idempotency. Per-partner adapter modules.
- Notification Worker — SMS, email, push. Already async, already failure-tolerant; cheap to separate, valuable to scale independently.
This gives us the operational simplicity of a monolith with the correctness of an isolated ledger and the resilience of an isolated integration boundary.
┌─────────────────────────────────────────────┐
│ CLIENTS │
│ Employer Web • Employee Mobile • Admin │
└─────────────────────────────────────────────┘
│ HTTPS
▼
┌─────────────────────────────────────────────┐
│ API GATEWAY (Kong / Envoy + WAF) │
│ • TLS termination • Authn (JWT) │
│ • Rate limiting • Idempotency-Key check │
│ • Request signing for partners │
└─────────────────────────────────────────────┘
│
┌──────────────────────────────┼──────────────────────────────┐
▼ ▼ ▼
┌───────────────────┐ ┌────────────────────┐ ┌───────────────────┐
│ DEMOZPAY CORE │ gRPC │ LEDGER SERVICE │ gRPC │ INTEGRATION GW │
│ (modular │◄──────►│ (Go, PostgreSQL │◄──────►│ (Go, per-partner │
│ monolith, NestJS│ │ isolated DB) │ │ adapters) │
│ + workers) │ └────────────────────┘ └────────┬──────────┘
│ │ │
│ • Identity │ ▼
│ • Tenancy │ ┌──────────────────────────┐
│ • Employee │ │ Banks / Wallets / MFIs │
│ • Payroll │ │ CBE, Awash, Telebirr, │
│ • EWA / Loan / │ │ M-Pesa ET, etc. │
│ BNPL / Equb │ └──────────────────────────┘
│ • Risk │
│ • Reporting │ Kafka events
│ • Admin │◄──────────────────────► Notification Worker
└───────────────────┘ (SMS / Email / Push)
│
▼
┌─────────────────────────────────────────────────────────┐
│ POSTGRES (primary + replica) • REDIS • KAFKA │
│ ClickHouse (analytics) • S3 (documents, audit) │
└─────────────────────────────────────────────────────────┘
2.2 Domain-Driven Design — bounded contexts
We model the business as 7 bounded contexts. These survive into the microservices era unchanged.
| Bounded Context | Owns | Talks to |
|---|---|---|
| Identity & Access | Users, sessions, tokens, RBAC, MFA, device trust | Everything via JWT |
| Tenancy | Employer (company) accounts, plans, configuration | Identity, Payroll, Billing |
| Workforce | Employees, departments, contracts, salaries, tax profiles | Payroll, Risk |
| Payroll | Pay cycles, calculations, tax rules, payslips, disbursement orchestration | Workforce, Ledger, Integration GW |
| Money | Wallets, transactions, ledger, settlement, reconciliation | All money-touching contexts |
| Lending | EWA, salary loans, BNPL, repayment schedules, collections | Workforce (income), Money, Risk |
| Savings & Community | Goals-based savings, Equb (group rotating savings) | Money |
Cross-context communication: published domain events on Kafka, never direct DB reads across contexts.
2.3 Event-driven where it pays for itself
We are event-driven by default for side effects, synchronous by default for money movement on the user's path. A user clicking "withdraw salary" needs an authoritative answer in 800ms — not "event published, check back later."
Use events for:
- Notification fan-out (SMS, email, push)
- Reporting / analytics ETL
- Cross-context updates (employee onboarded → KYC kickoff → wallet provision)
- Risk signals (large transfer → fraud check)
- Audit trail mirroring to immutable storage
Do not use events for:
- The actual debit/credit of a wallet (synchronous gRPC to ledger).
- Authentication.
- Anything where the caller cannot reason about "did it happen?"
2.4 Strangler-fig migration plan to microservices
We will carve services out only when triggered, not on a schedule. Triggers:
- A bounded context has 3+ engineers full-time for >3 months → split it out.
- A context has distinct scaling profile (e.g. notifications at 50x request rate of payroll) → split it out.
- A context has distinct compliance scope (e.g. card data) → split it out.
- A context becomes a deployment bottleneck (one team can't ship because of another) → split it out.
Order we expect to carve out (rough, not prescriptive):
Year 1: [Core monolith] + [Ledger] + [Integration GW] + [Notification]
Year 2: + KYC Service (own adapters, own data retention)
+ Risk/Fraud Service (own ML stack, Python)
+ Reporting Service (own read-model, ClickHouse-native)
Year 3: + Payroll Engine (carved from Core, scaled separately)
+ Lending Service (EWA + Loans + BNPL together)
+ Settlement & Reconciliation Service
Year 4+: + Per-product services if/when product complexity demands it
3. Service Catalog
For each service: purpose · owner · key APIs · data ownership · sync vs async · language.
3.1 API Gateway
- Purpose: TLS termination, authn, rate limiting, request signing verification, idempotency-key enforcement, request/response logging, WAF.
- Tech: Kong OSS (or Envoy + custom filters). Cloudflare in front for DDoS + WAF where data residency allows.
- Not in scope: business logic, authorization beyond "is this token valid?" Authorization decisions live in the services.
- Critical settings: request body size cap (e.g. 1MB default, 8MB for document upload routes); per-route rate limits; mTLS for partner webhooks.
3.2 Identity & Auth Service
- Inside the monolith as a module, but with a strict public API surface so we can carve it out later.
- Purpose: registration, login (employer + employee), session, refresh tokens, MFA, password reset, device registration, RBAC, scopes.
- APIs:
POST /v1/auth/registerPOST /v1/auth/login(returns short-lived JWT + refresh)POST /v1/auth/refreshPOST /v1/auth/mfa/challenge,POST /v1/auth/mfa/verifyPOST /v1/auth/step-up(for high-value ops)GET /v1/auth/mePOST /v1/auth/logout
- JWT: 15-minute access tokens. Refresh tokens stored hashed (Argon2id) in DB with rotation on use. Token includes
sub,tid(tenant ID),roles,scopes,device_id,iat,exp,jti. - MFA: TOTP first; SMS OTP as fallback (acknowledging SIM-swap risk — flag SMS-OTP as weaker auth in risk engine).
- Step-up: any operation moving > X ETB or modifying payroll requires re-auth in the last 5 minutes (configurable per tenant).
3.3 Payroll Service (module, monolith year 1)
- Purpose: salary structures, pay cycles, tax & pension calculation, payslip generation, disbursement orchestration.
- Key concept: payroll calculation is deterministic and replayable. Given the same inputs (employees, salary structures, tax rules at date T), output must be byte-identical. This is required for audits.
- APIs:
POST /v1/payroll/runs— create a draft runPOST /v1/payroll/runs/{id}/calculate— runs the engine, produces payslips (no money moves)POST /v1/payroll/runs/{id}/approve— locks the run, no further editsPOST /v1/payroll/runs/{id}/disburse— orchestrates ledger postings + bank/wallet payouts (saga)GET /v1/payroll/runs/{id}/report
- Tax engine: rules are versioned and date-effective. Never edit a tax rule in place; create a new version with an effective date. Ethiopian income tax brackets, pension (Public Servants 7%/11%, Private 7%/11%), Cost-Sharing as applicable.
- Async: disbursement is a long-running saga (see §4.7). API returns 202 with a run ID; UI polls a run status endpoint.
3.4 Employee / Workforce Service (module)
- Purpose: employees, departments, hierarchies, contracts, salary history.
- Key concept: bitemporal data. Every change has a valid-from / valid-to date AND a recorded-at date. We can answer "what was this employee's salary on March 1 as we believed it on March 5?" — required for retroactive payroll and audit.
- APIs: CRUD on employees, departments, contracts; bulk CSV import; salary change history.
3.5 Wallet Service (module, calls Ledger Service)
- Purpose: per-user / per-business logical wallet — balance views, transaction history, P2P transfers, top-ups, withdrawals.
- Important: wallet balance is not stored as a column. Balance is derived from the ledger (or cached materialized view). This is non-negotiable. See §4.2.
- APIs:
GET /v1/wallets/{id}/balanceGET /v1/wallets/{id}/transactionsPOST /v1/wallets/{id}/transfers(requiresIdempotency-Keyheader)POST /v1/wallets/{id}/topupPOST /v1/wallets/{id}/withdraw
3.6 Ledger Service (DAY 1 SEPARATE SERVICE)
- Purpose: double-entry journal. Single source of truth for all money in the system.
- Language: Go. Reasons: predictable latency, no GC stalls of concern at our scale, easy to package as a small static binary, strong concurrency primitives for batch posting.
- Storage: dedicated PostgreSQL cluster, separate from monolith DB. Tables append-only. No DELETE statements anywhere in the codebase — enforced by linter and DB-level revoked permissions.
- APIs (gRPC):
PostTransaction(req)— acceptsidempotency_key, an array of entries that must sum to zero, optional metadata. Returns transaction ID.GetBalance(account_id, as_of_timestamp?)— point-in-time balance.GetEntries(filters)— read.Reverse(transaction_id, reason)— posts a reversing transaction, never deletes.
- Invariants enforced inside the service (not at the caller):
- All entries in a transaction sum to zero per currency.
- Account types have sign rules (asset/liability/income/expense/equity); violations are rejected.
- Idempotency-key uniqueness window is 24 hours minimum.
- Why isolated from day 1: the blast radius of a bug in ledger code is catastrophic. Separation enforces a tiny API surface and dedicated review attention.
3.7 EWA Service (module, calls Ledger + Risk + Integration GW)
- Purpose: earned-wage access — let employee draw a portion of accrued-but-unpaid salary.
- Mechanics: accrual is computed daily as (monthly_salary × elapsed_workdays / total_workdays) − already_drawn − projected_deductions. Cap configurable (e.g. 50% of accrued).
- APIs:
GET /v1/ewa/accrualPOST /v1/ewa/draw— idempotent; runs risk check; posts ledger entry; triggers payout via Integration GW.
- Repayment: on payroll run, EWA outstanding is netted from the payslip before tax — actually after tax, depending on regulator stance. TBD with NBE/legal. Track outstanding in ledger; settle automatically on payday.
3.8 Loan Service (module → carved out year 2)
- Purpose: salary-backed term loans, BNPL installments.
- Components: loan product catalog, application, decisioning hook (delegates to Risk), disbursement (delegates to Ledger + Integration GW), repayment schedule, collections, NPL tracking.
- State machine:
APPLIED → UNDERWRITING → APPROVED/REJECTED → DISBURSED → REPAYING → CLOSED / DEFAULTED / WRITTEN_OFF. State transitions are events.
3.9 Equb / Savings Service (module)
- Purpose: group rotating-savings (Equb is the Ethiopian rotating-savings tradition — N members contribute monthly, each month one member receives the pot, rotated until everyone has been paid out).
- Mechanics: create group, invite members, lock contribution schedule, define rotation (lottery, fixed, bid-based), execute monthly cycle on a scheduled job. Each cycle is a deterministic ledger transaction.
- Why custom and not a generic savings product: Equb has social enforcement, group rules, missed-payment penalties that are very specific. This is a differentiating feature for Ethiopia and worth building well.
3.10 KYC Service (year-2 carve-out; module year 1)
- Purpose: identity verification flows, document capture, liveness check, sanctions/PEP screening, ongoing AML monitoring.
- External integrations: Fayda (Ethiopian National ID), partner KYC vendor (e.g. Smile ID, Veriff equivalent operating in Ethiopia), sanctions lists (UN, OFAC, internal blacklists).
- Data residency: KYC artifacts (photos, ID scans) MUST be stored in-country. Encrypted with per-tenant data keys. Retention: regulator-defined; default 7 years post account closure.
3.11 Integration Gateway (DAY 1 SEPARATE SERVICE)
- Purpose: the only process in the system that talks to external banks/wallets/MFIs.
- Why isolated: outbound integrations are the #1 source of unbounded latency, partner outages, and credential leakage. Centralizing them gives us one place for circuit breakers, retries, secrets, idempotency, observability.
- Structure:
Integration Gateway├── adapters/│ ├── cbe/ (Commercial Bank of Ethiopia)│ ├── awash/│ ├── dashen/│ ├── telebirr/│ ├── mpesa_et/│ ├── cbe_birr/│ └── ...├── core/│ ├── circuit_breaker.go│ ├── retry.go│ ├── idempotency.go│ ├── signing.go│ └── audit.go└── grpc/└── server.go (uniform internal API)
- Uniform internal API:
InitiatePayout(partner, account, amount, reference, idempotency_key)QueryPayout(partner, reference)— for reconciliationRegisterAccount(partner, kyc_payload)HandleWebhook(partner, payload, signature)— verifies and translates to internal event
- Adapter contract: every adapter exposes the same interface. Per-adapter quirks (different status codes, different idempotency semantics, different webhook formats) are normalized inside the adapter.
3.12 Notification Worker (DAY 1 SEPARATE SERVICE)
- Purpose: SMS, email, push notifications, in-app messages, WhatsApp Business.
- Why isolated: different scaling profile, different SLAs (best-effort, not money-correctness), different external dependencies (Twilio/local SMS aggregator, SendGrid, FCM, Meta).
- Pattern: Kafka consumer group → template render → provider dispatch → delivery tracking → DLQ on failure.
- Templates: versioned, A/B testable, localized (Amharic, Oromo, Tigrinya, English at minimum).
3.13 Risk / Fraud Service (module → carve-out year 2 in Python)
- Purpose: real-time risk scoring on transfers, EWA, loan applications, login; rules engine + ML models.
- APIs:
Evaluate(context, action)→ALLOW | CHALLENGE_MFA | REVIEW | DENYwith reasons
- Components: rules engine (configurable in UI by ops), feature store (Redis + offline store in ClickHouse), model serving (year 2: gradient-boosted on transaction features).
- Day-1 rules to ship: velocity limits, device-change checks, geo anomalies, new-payee delay, large-amount thresholds, dormant-account-reactivation flag.
3.14 Reporting Service (module → carve-out)
- Purpose: all reads that don't need to be transactional — payroll reports, employer dashboards, financial statements, regulator reports.
- Backed by: read replica of PG + ClickHouse for heavier aggregations.
- Crucially: reports do NOT query the ledger directly during business hours at scale. The ledger is for writes and authoritative balance lookups. Reports use materialized views refreshed every N minutes or CDC-driven projections.
3.15 Audit Service (module, mirrors to immutable storage)
- Purpose: structured audit log of every state-changing action — who did what to which entity, with full before/after diff.
- Append-only. Two-tier storage:
- Hot: PG
audit_logtable, ~90 days. - Cold: object storage (S3-compatible, in-country) with object-lock / WORM. Daily export. 7+ year retention.
- Hot: PG
- Every service emits audit events via a shared library. Audit emission is part of the same DB transaction as the state change (outbox pattern — see §4.5).
3.16 Settlement & Reconciliation Service (year 2)
- Purpose: daily settlement runs against each partner — pulling partner statements, matching against our ledger, producing a reconciliation report, flagging breaks for ops.
- Why critical: if our internal ledger and the bank's records diverge for a week, we have a regulatory and financial problem. A break detected within 24h is recoverable.
3.17 Admin / Backoffice Service (module)
- Purpose: internal tools — customer support views, manual ledger adjustments (with dual control), tenant config, fraud queue, KYC review queue, content & template management.
- Security: stricter authn (mandatory MFA + step-up + IP allowlist). Every action audited. Dual control / maker-checker for any money-moving manual operation.
4. Fintech Core
This is the section where we go slow and explicit. Get this wrong and nothing else matters.
4.1 Why double-entry, and why immutable
Every financial movement is recorded as a transaction containing 2 or more entries that sum to zero per currency. Each entry posts to an account which has a type (asset, liability, income, expense, equity) governing its normal sign.
Example: employee Aster draws 500 ETB via EWA.
Transaction T-9381 | ref: ewa_draw_84
Idempotency: ewa:84
─────────────────────────────────────────────────────────────
Account | Debit | Credit
─────────────────────────────────────────────────────────────
liability:employee:aster:wallet | | 500.00
asset:ewa_advance_receivable:employer:42 | 500.00 |
─────────────────────────────────────────────────────────────
500.00 500.00 ✓
Why immutable:
- Audits — regulators must be able to reproduce balances at any point in history.
- Forensics — if money is wrong, we must show exactly how it became wrong.
- Trust — never let support staff or engineers "edit a transaction."
If a transaction was wrong, you don't edit. You post a reversing transaction referencing the original. The journal grows; it never shrinks.
4.2 Balance is derived, not stored (with a cached projection)
- Authoritative balance =
SUM(credits) − SUM(debits)over all entries for the account, optionallyWHERE posted_at <= as_of. - Hot-path balance = a materialized view
account_balances(account_id, balance, updated_at, last_entry_id)updated transactionally inside the same DB transaction as the journal write. - The hot-path view is for performance only. A nightly job reconciles it against the journal and alerts on drift.
This is a hill we die on. Never store a balance column on a wallet table and increment it from application code. That is the most common way fintech startups silently corrupt balances.
4.3 Idempotency — the single most important non-financial-but-financial concept
Every money-moving API endpoint MUST require an Idempotency-Key header. Server stores (idempotency_key, request_hash, response, expires_at) for at least 24h. Behavior:
- Same key + same request body → return the previously-stored response (do NOT re-execute).
- Same key + different request body → reject with 409 (caller bug or replay attack).
- Different key → execute.
Idempotency lives at three layers:
- API Gateway — fast rejection of duplicate keys with cached response.
- Service layer — wraps business logic.
- Ledger —
PostTransactionhas its own idempotency. Even if all other layers fail, the ledger refuses to double-post.
4.4 Sagas for distributed money movement
A payroll disbursement touches: Payroll → Ledger → Integration GW → (Bank) → Ledger again on confirmation. There is no single DB transaction across these. We use the Saga pattern:
Step 1 Payroll: lock run, generate payout intents (compensation: unlock run)
Step 2 Ledger: post intent transaction (pending state) (compensation: post reversal)
Step 3 Integration GW: send payout instruction to bank (compensation: cancel/recall)
Step 4 Await webhook / poll for confirmation
Step 5a Success: ledger transition pending → posted
Step 5b Failure: run compensations in reverse order
Implementation: an orchestrator (Payroll Service owns it) drives the saga. Each step is idempotent. State persisted in a saga_instances table. Restart-safe.
Why orchestration over choreography here: money movement requires centralized observability of failure. Choreography spreads error handling across services; ops cannot reason about "where did the disbursement get stuck?"
4.5 The Outbox pattern (because Kafka is not in your DB transaction)
When a service updates a row AND emits an event, two failure modes exist:
- DB commits, event publish fails → downstream never knows.
- Event publishes, DB rolls back → downstream acts on a fact that didn't happen.
Both are catastrophic. Solution: write the event into an outbox table in the same DB transaction as the state change. A separate publisher process tails the outbox (via PG logical replication / Debezium or simple polling) and publishes to Kafka with at-least-once delivery. Consumers are idempotent.
4.6 Reconciliation
Three reconciliations run on schedule:
- Internal: journal vs balance projection. Daily. Any drift = page on-call.
- Partner: bank statement vs our outbound payment log. Daily. Breaks go to ops queue with SLA.
- End-to-end: for each payroll run, sum(payslips) must equal sum(disbursements posted) must equal sum(partner-confirmed payouts). Per-run report.
4.7 Transaction states and the rollback strategy
Ledger transactions have states: PENDING → POSTED (happy path), or PENDING → CANCELLED (compensated before posting), or POSTED → REVERSED (post-hoc fix via reversing transaction).
We never DELETE. We never UPDATE financial columns. We only INSERT.
4.8 Currency, precision, time
- All amounts stored as integers in minor units (cents/santim). Never floats.
NUMERIC(20, 0)in PG — 20 digits is enough for any realistic ETB amount in santim.- Multi-currency from day 1 in the schema (currency column), even if we only support ETB at launch.
- Timestamps:
TIMESTAMPTZonly, always UTC at storage, render in user timezone in UI. - FX: we do not provide FX at launch. When we do, FX is its own bounded context with its own rate sourcing and spread management.
5. Security Architecture
5.1 Threat model (abbreviated)
| Threat | Likelihood | Impact | Primary mitigations |
|---|---|---|---|
| Account takeover (phishing, SIM swap) | High | High | MFA (TOTP preferred), device binding, step-up, anomaly detection |
| Insider abuse (employee moves money) | Medium | Catastrophic | Dual control, audit, separation of duties, JIT access, no prod DB access |
| Partner credential leak | Medium | High | Secrets in Vault, short rotation, per-env credentials, mTLS to partners |
| Ledger corruption (bug or attack) | Low | Catastrophic | Append-only, separate service, isolated DB, dual reconciliation |
| API abuse / scraping | High | Medium | Rate limiting, WAF, bot detection |
| Webhook spoofing | Medium | High | HMAC signatures, IP allowlisting, replay protection (nonce + timestamp) |
| Data exfiltration | Medium | High | Field-level encryption for PII, egress monitoring, DLP, minimum-privilege DB access |
| Supply chain (dep compromise) | Medium | High | SBOM, dependency pinning, signed artifacts, image scanning |
5.2 Identity & Access (humans)
- OAuth2 + OIDC for all auth flows.
- JWT access tokens signed with RS256 (asymmetric — services verify with public key, only auth service holds private key). Rotate keys quarterly, support 2 active keys via JWKS.
- Refresh tokens stored as Argon2id hashes in DB, rotated on every use, family-tracked for replay detection (if an old refresh is reused, invalidate the whole family).
- Session policy:
- Mobile: refresh token valid 30 days, slide on use.
- Web (employer admin): 8h inactivity, 24h max.
- Backoffice: 1h inactivity, 8h max, mandatory MFA.
- MFA: TOTP via authenticator app (preferred). SMS OTP supported but flagged as weaker (SIM-swap-prone in Ethiopia). WebAuthn for backoffice users.
- Step-up auth for any of: changing payout account, large transfer (configurable threshold), changing payroll structure, admin user creation.
5.3 Identity & Access (services — zero trust)
- mTLS between every service. Certificates issued via cert-manager + an internal CA. Short rotation (24–48h).
- No service trusts another service's claim about who the user is. Every internal call carries the original JWT in a forwarded header (
X-Forwarded-Authorization). Each service validates it independently. - Service-to-service authn via SPIFFE/SPIRE identities (year 2+; year 1 use simple mTLS cert subject matching).
- No long-lived secrets in environment variables. Vault-issued, short-lived credentials only.
5.4 RBAC
- Roles per tenant: Owner, HR Admin, Finance Admin, Manager, Employee. Plus platform-side: SupportL1, SupportL2, FraudOps, FinanceOps, Engineer (read-only), Engineer (DBA — break-glass only).
- Permissions are explicit and additive. Roles are named bundles of permissions; the permission check uses the underlying permission, never the role name. This means we can change role definitions without touching code.
- Scopes on tokens for delegated access (e.g. mobile app gets a token with
wallet:read wallet:transferbut notpayroll:write). - Attribute-based checks layered on top: an HR Admin in Tenant A can never read Tenant B's employees. Enforced at the data access layer with mandatory
tenant_idfiltering — and at the DB level via Row-Level Security as defense in depth.
5.5 Secrets management
- HashiCorp Vault (or cloud-native KMS-backed alternative) as the only source of secrets in production.
- Dynamic database credentials — Vault issues a per-pod, per-hour DB user with the right grants.
- No secret in Git. Ever. Pre-commit hooks + CI scanning (gitleaks, trufflehog).
- Per-environment isolation — dev/staging/prod credentials are entirely separate trust domains.
5.6 Encryption
- In transit: TLS 1.3 everywhere. HSTS preload on public domains. Internal traffic mTLS.
- At rest: disk encryption (LUKS / cloud provider managed). PG transparent data encryption where available, or full-disk encryption at the volume level.
- Field-level encryption for sensitive PII (national ID number, bank account number, tax ID): envelope encryption with per-tenant data keys, master key in HSM-backed KMS. Application sees plaintext only when needed for the request; logs and DB backups carry ciphertext.
- Tokenization for any card data (if/when we touch cards) — we should aim to never store PANs and route through a PCI-DSS-compliant vendor.
5.7 API security
- WAF in front (Cloudflare or equivalent where data-residency permits, otherwise self-hosted ModSecurity / NAXSI).
- Rate limiting at gateway: per-IP, per-user, per-endpoint. Lower limits for auth endpoints; aggressive lockout on credential stuffing.
- Idempotency-Key required on all POSTs to money endpoints.
- Request signing for partner webhooks: HMAC-SHA256 with timestamp + nonce; reject if timestamp > 5 minutes old.
- CORS: allowlist of origins, not
*. - CSRF: double-submit cookie pattern on the employer web app; not relevant for API tokens.
- Input validation: schema-driven (Zod / class-validator) at the controller boundary. Never trust the client.
5.8 Anti-fraud
- Velocity rules: N transfers per minute/hour/day; total amount caps.
- Device fingerprinting: track device IDs; flag new devices for step-up.
- Geolocation anomaly: sudden country change → challenge.
- Behavioral patterns: time-of-day, typical recipients; flag deviations.
- New-payee delay: first transfer to a new external account requires a 30-minute hold + MFA confirmation.
- Chargeback / dispute workflow: every disputed transaction reversible via the reversal mechanism.
5.9 OWASP fintech-specific risks (and our defenses)
| Risk | Defense |
|---|---|
| BOLA (Broken Object-Level Auth) | Mandatory tenant + ownership check on every resource access, enforced via base repository |
| Mass assignment | DTO whitelisting, never spread request body into entities |
| SSRF (via webhook URLs, document fetching) | Egress allowlist, block private IP ranges, dedicated egress proxy |
| Server-side request injection | Parameterized queries everywhere, no string concat for SQL |
| Insufficient logging | Structured logs, audit log, centralized aggregation, alerting |
| Misconfiguration | IaC + drift detection, security baselines codified |
| Vulnerable dependencies | Renovate / Dependabot, daily image scans (Trivy), SBOM published |
5.10 Insider threat & operational security
- No engineer has direct write access to production databases. Period.
- Read-only break-glass access requires an approval ticket, time-boxed (4h), fully audited, with PII redaction at the proxy layer.
- Dual control / maker-checker on: manual ledger adjustments, KYC overrides, fraud-flag overrides, role assignments to platform roles.
- Mandatory just-in-time access for production via SSO + approval workflow (Teleport or equivalent).
- Production data never copied to dev/staging. Synthetic data only.
5.11 Compliance readiness
- NBE (National Bank of Ethiopia) — payment licensing, capital requirements, data residency, reporting obligations. Architecture is in-country data plane + audit + reconciliation reports designed to match NBE templates.
- Data Protection Proclamation (Ethiopia) — DPIA per product launch, lawful basis tracked per data category, DSR (data subject request) workflow.
- PCI-DSS readiness even if we don't store cards initially — tokenization mindset, scope reduction, segmented network.
- ISO 27001 / SOC2 trajectory — aim for SOC2 Type 1 by month 18, Type 2 by month 24. Build controls into the engineering workflow now (least privilege, change management, code review, audit logs) — bolting them on later is more expensive.
6. Infrastructure & DevOps
6.1 Kubernetes & topology
- Managed Kubernetes where available (AWS EKS in Cape Town, Azure AKS in South Africa). For regulated data, in-country Kubernetes on bare metal or local cloud (Ethio Telecom DC, Safaricom Ethiopia DC) — depends on partnership.
- Cluster layout:
prod(in-country, regulated data plane)prod-edge(regional cloud, public APIs, observability, non-regulated analytics)stagingdev
- Namespaces per bounded context. Network policies default-deny; explicit allow lists.
- Pod Security Standards: restricted. No privileged containers, no host network, no host paths.
6.2 Containers & images
- All services packaged as OCI images built from distroless or chiselled Ubuntu base images.
- Multi-stage Dockerfiles: build-time deps separated from runtime image.
- Image signing with cosign; admission controller (Kyverno / Gatekeeper) rejects unsigned images in prod.
- Vulnerability scanning at build (Trivy) and at registry (Harbor / ECR scan); CVSS ≥ 7.0 blocks promotion.
- SBOM generated and stored per build.
6.3 CI/CD
- Source: GitHub (or GitLab self-hosted if data residency requires).
- CI: GitHub Actions / Jenkins / GitLab CI — pick one, stick with it. Recommendation: GitHub Actions for speed of setup.
- Pipeline stages:
- Lint + type-check
- Unit tests
- Integration tests (against real PG, real Redis in CI)
- Build image
- Security scan (Trivy, Semgrep, gitleaks)
- Push to registry
- Deploy to staging via ArgoCD
- Smoke tests + contract tests
- Manual promotion to prod (year 1) → automated promotion with canary (year 2)
- GitOps with ArgoCD: the cluster state is whatever is in the
infra/Git repo. Nokubectl applyin production. - Trunk-based development with short-lived feature branches; merge to main behind feature flags.
- Feature flags: Unleash (open source) self-hosted. Every new feature ships behind a flag. Kill-switches on every money-moving feature.
6.4 Observability
- Metrics: Prometheus + Grafana. Per-service RED metrics (Rate, Errors, Duration). Business metrics first-class (payroll runs/day, EWA draws/day, settlement breaks).
- Logs: structured JSON. Loki (or Elastic) for aggregation. Strict log schema; PII redacted at emission.
- Traces: OpenTelemetry SDK in every service, Tempo (or Jaeger) as backend. Mandatory trace propagation through Kafka headers.
- Errors: Sentry for application errors with PII scrubbing.
- Alerting: Alertmanager → PagerDuty (or local equivalent). Alerts have runbooks; alerts without runbooks get deleted.
- SLO-driven: define SLOs per service (e.g. wallet balance read p99 < 200ms, EWA draw success rate ≥ 99.5%). Alert on burn rate, not on individual symptoms.
- Audit/security events routed to a separate, tamper-resistant store with longer retention.
6.5 Autoscaling
- HPA on CPU + custom metrics (queue depth, request rate).
- Cluster autoscaler for nodes.
- DB scaling: vertical first; read replicas for read scaling; partitioning for write hot-spots (ledger by month).
- Kafka: horizontal via partitions; size partitions for 1 year of growth, not next quarter.
6.6 Disaster recovery
- RPO target: 5 minutes for ledger; 1 hour for non-financial data.
- RTO target: 4 hours full-system recovery.
- Backups:
- PG: continuous WAL archiving to in-country object storage + daily base backup.
- Encrypted at rest with keys in HSM/KMS.
- Backups tested monthly by restoring to an isolated environment and running reconciliation queries. An untested backup is not a backup.
- DR plan:
- Active-passive across two in-country DCs from year 1.
- Active-active in year 3 if regulator and topology permit.
- Annual DR drill; failover under load tested at least once.
6.7 IaC
- Terraform for cloud resources, Helm + Kustomize for Kubernetes manifests.
- All infra in code. Manual changes flagged by drift detection.
- Atlantis or Terraform Cloud for plan/apply with PR-based review.
6.8 Uptime strategy
- 99.9% (= ~43 min/month downtime) target for year 1. 99.95% by year 3.
- Aggressive use of feature flags + canary deploys to limit blast radius.
- Database migrations: forward-only, online via tools like
pg_repackfor heavy changes; never block writes for more than seconds. - Pre-launch chaos drills monthly: kill pods, sever DB, fail a partner.
7. Database & Data Architecture
7.1 Why PostgreSQL is the only acceptable primary store for money
- ACID + true MVCC.
- Mature replication and PITR.
NUMERIC,JSONB,RANGE,TIMESTAMPTZ, exclusion constraints, row-level security, deferrable constraints — every feature we need.- Massive operational ecosystem.
We do not use MongoDB, DynamoDB, or any eventually-consistent store as the system of record for money. They have legitimate uses elsewhere; not here.
7.2 Database-per-service vs shared
- Year 1: one PG cluster for the monolith (one logical database, schema-per-context). Strong transactional boundaries inside contexts.
- Year 1: separate PG cluster for the Ledger Service. Isolated trust domain.
- As services are carved out, each gets its own database. Never allow another service to read your DB directly. Public API or events only.
7.3 Schema design principles
- Every row has
id(UUID v7 — time-ordered),created_at,updated_at,created_by,updated_by. - Every tenant-scoped row has
tenant_id. Row-level security policies enforce tenant isolation as defense in depth. - Soft delete via
deleted_atfor non-financial tables; financial tables have no delete at all. - Foreign keys everywhere. Yes, even at scale.
CHECKconstraints for enums (or apg_enum).- Money amounts as
NUMERIC(20, 0)in minor units.
7.4 Read replicas
- One streaming replica per PG cluster for read-heavy queries (reports, list endpoints).
- Replica lag monitored; reads that must be consistent (e.g. balance immediately after a transfer) go to primary.
- Reporting service consumes from replica + ClickHouse projection.
7.5 Partitioning
- Ledger entries partitioned by month (
PARTITION BY RANGE (posted_at)). Old partitions detached and archived after 24 months online retention (still queryable via lazy attach). - Audit log similarly partitioned.
- Other large tables (transactions, notifications) partitioned by month once they exceed ~100M rows.
7.6 Redis
- Session cache (token introspection cache, MFA challenge state).
- Rate limiting counters (token bucket).
- Idempotency key cache in front of DB.
- Hot-path projections (wallet balance, EWA accrual) with strict TTL.
- NOT a system of record. Never depend on Redis for correctness — it is a cache. Code must function (slower) if Redis is down.
7.7 Event streaming — Kafka or Redpanda
- Kafka for events. Redpanda is a good single-binary alternative if we want lower ops overhead at startup.
- Topics named
domain.entity.event(e.g.payroll.run.disbursed). - Schema-registry-backed (Confluent or Karapace) with Avro or Protobuf, never raw JSON for cross-team contracts.
- Exactly-once is hard and expensive — we don't aim for it. Instead: at-least-once delivery + idempotent consumers.
- Retention: 7 days online for most topics, 30 days for audit/financial event topics, then archive to object storage.
7.8 Event sourcing & CQRS
- Not full event sourcing across the system. Costs are high, hiring complexity is real.
- Yes for the ledger journal — by definition, the journal is an immutable event log. The balance projection is a CQRS read model over it. This is the natural fit and it is the only place we apply it.
- CQRS-lite elsewhere: read models in ClickHouse for reporting, refreshed via CDC. Write model in PG.
7.9 ClickHouse for analytics
- All operational events (transactions, logins, EWA draws, payroll runs) replicated to ClickHouse via CDC.
- Used for: BI dashboards, regulator reports, risk feature offline store.
- Never queried on the user's request path.
7.10 Object storage
- In-country S3-compatible (MinIO, Wasabi-equivalent, or local cloud provider) for documents (payslips, ID scans, contracts, audit cold storage).
- Object lock / WORM for audit and compliance artifacts.
8. Language & Framework Decisions
We optimize for three things, in this order: correctness for money, velocity for startup, hireability in Ethiopia / East Africa.
8.1 Summary recommendation
| Service / Layer | Language / Framework | Why |
|---|---|---|
| API Gateway | Kong (Lua) or Envoy | Mature, configurable, not our code |
| Identity & Auth | NestJS (TypeScript) | Same stack as monolith; rich ecosystem (Passport, Argon2, jose) |
| DemozPay Core (monolith) | NestJS (TypeScript) | Velocity, hireability, opinionated DI, decorators map cleanly to DDD modules |
| Payroll engine internals | NestJS + a TS calculation library | Determinism via pure functions; reuse with Node workers |
| Wallet (façade over Ledger) | NestJS | Thin orchestration over the Ledger Service |
| Ledger Service | Go | Predictable latency, GC-friendly footprint, easy static binary, clean concurrency for batch posting; small surface area justifies a separate language |
| Integration Gateway | Go | Same — long-lived TCP / HTTP connections, low-overhead workers, ergonomic concurrency, easy to build per-partner adapters |
| Notification Worker | Go OR NestJS | Either works; pick what the team is faster in. NestJS if the team is TS-heavy. |
| Risk / Fraud (year 2 carve-out) | Python (FastAPI) | ML ecosystem (sklearn, XGBoost, LightGBM, feature engineering); only place Python earns its slot |
| Reporting / BI | SQL + dbt; Python for any orchestration | Stick to SQL — the team can hire for it easily |
| Mobile app | Flutter (or React Native) | One codebase, two stores |
| Employer web | Next.js (React, TS) | Already evident in your repo |
8.2 Reasoning, by candidate
NestJS / TypeScript — recommended primary stack.
- Pros: large Ethiopian developer pool already on JS/TS; strong typing (sufficient if
strict: trueis enforced); rich ecosystem (TypeORM/Prisma, BullMQ, class-validator, OpenAPI generation); DI and modular structure map naturally to bounded contexts; Next.js stack is already in use. - Cons: Node's single-threaded model is unfriendly to CPU-bound work; ecosystem encourages npm bloat; not the best for ultra-low-latency money path.
- Verdict: primary stack for product surfaces, NOT for the ledger.
Go — recommended for ledger, integration gateway, settlement, high-throughput workers.
- Pros: predictable performance, small binaries, excellent stdlib for networking and concurrency, easy to operate (single binary, no runtime install), strong fit for "many adapters in one process," fewer dependencies = smaller attack surface.
- Cons: more verbose than TS; smaller hiring pool locally; team must learn it.
- Verdict: use Go where correctness or latency cost-per-mistake is highest. Not everywhere.
Java / Spring Boot — not recommended for the startup phase.
- Pros: best-in-class for traditional banking core systems; massive ecosystem; battle-tested.
- Cons: heaviness, slow startup, opinionated patterns that increase friction at startup velocity, larger memory footprint per pod, harder to hire mid/senior Java fintech engineers locally.
- Verdict: skip. If you ever build a true core-banking system, revisit. Not for v1.
Rust — not recommended now.
- Pros: performance, safety, growing ecosystem.
- Cons: ramp-up cost is enormous for a startup; hiring is extremely hard locally; over-engineered for our actual perf requirements.
- Verdict: skip year 1–2. Revisit only if a specific high-throughput, safety-critical component justifies it (it probably won't).
Python — only where ML / data work is the primary value.
- Pros: ML ecosystem unmatched; data engineering productivity.
- Cons: production runtime characteristics weaker than Go/Java; type-system add-ons (mypy) still imperfect; not great for transactional services.
- Verdict: Risk Service (year 2), data engineering, ML training. Not for transactional services.
Kotlin — interesting but not justified.
- Pros: better Java; coroutines.
- Cons: hiring pool tiny in Ethiopia; doesn't add anything over Go for our use cases.
- Verdict: skip.
8.3 The "two-language ceiling" rule
We deliberately limit production languages to two in year 1 (TypeScript + Go), three by year 2 (+ Python for Risk). Every additional language is an additional CI pipeline, library, on-call skillset, and security review. Resist the urge to add Rust because someone read a blog.
9. API Design
9.1 External vs internal
- External APIs (mobile, web, partners): REST/JSON with OpenAPI 3.1 specs as source of truth.
- Internal service-to-service: gRPC on the critical path (ledger calls, integration gateway calls). HTTP/JSON for everything else.
- Async: Kafka events with Protobuf or Avro schemas.
9.2 REST conventions
- URL versioning:
/v1/.... New major version → new path; old version supported for a deprecation window (12 months minimum). - Resource naming: plural, kebab-case (
/v1/payroll-runs). - Standard verbs only. No
POST /v1/users/{id}/do-thing— insteadPOST /v1/users/{id}/actions/do-thingfor verbs that don't map to CRUD. - Pagination: cursor-based.
?cursor=...&limit=.... Offset pagination is a trap at scale. - Filtering: explicit query params; no generic query languages exposed.
- Errors: RFC 7807 (Problem Details). Stable
typeURI per error class;traceIdincluded. - Timestamps: ISO 8601 with timezone (
2026-05-23T10:15:30.123Z).
9.3 Required headers
Authorization: Bearer ...for authenticated calls.Idempotency-Key: <uuid>REQUIRED for allPOSTto money endpoints; accepted for all other POSTs.X-Request-IDpropagated through every service for tracing.Accept-Languagefor i18n.
9.4 OpenAPI spec-first
- Specs live in the repo at
apis/<service>/v1/openapi.yaml. - Server stubs and client SDKs generated from specs (NestJS via
nestjs-swagger, Go viaoapi-codegen). - Lint with Spectral; CI blocks merges that break the spec without a version bump.
9.5 gRPC for internal
.protofiles inapis/proto/. Buf for linting + breaking-change detection.- Backward-compatible changes only on a minor version. Major (breaking) requires a new package name.
- mTLS + token forwarding in metadata.
9.6 Webhooks (incoming and outgoing)
- Outgoing to employers / partners:
- HMAC-SHA256 signature in
X-DemozPay-Signatureheader overtimestamp + body. - Replay protection: timestamp + nonce, reject if > 5 min skew.
- Retry policy: exponential backoff with jitter, max ~24h, then dead-letter.
- Webhook delivery log persisted per attempt; UI for partners to view + replay.
- HMAC-SHA256 signature in
- Incoming from banks / wallets:
- Per-partner signature verification (each partner has its own scheme; normalized inside the adapter).
- Idempotent processing (partner-provided reference + nonce).
- All raw payloads stored for audit and dispute defense.
9.7 SDKs
- Year 1: only generated TS SDK for the Next.js admin app and a generated Dart SDK for Flutter. Don't manually maintain SDKs.
- Year 2+: published partner SDK (TS + Python + Go) if there is real partner demand.
10. Financial Integrations
10.1 Architecture pattern: Anti-Corruption Layer per partner
Every external partner's API is messy in its own special way. The Integration Gateway has one adapter per partner that translates between the partner's domain (statuses, ID formats, error codes, retry semantics) and our domain. The rest of the system speaks only the canonical internal API.
┌─────────────────────────────────────────────┐
│ INTEGRATION GATEWAY │
│ │
Internal gRPC │ ┌───────────────┐ ┌───────────────┐ │ External
(canonical) ───►│ │ Canonical API │ │ Adapter A │───►│──► Partner A
│ │ + Orchestration│ │ (CBE) │ │
│ │ + Circuit Br. │ ├───────────────┤ │
│ │ + Retry │ │ Adapter B │───►│──► Partner B
│ │ + Idempotency │ │ (Telebirr) │ │
│ │ + Audit │ ├───────────────┤ │
│ │ + Webhooks │ │ Adapter C │───►│──► Partner C
│ └───────────────┘ │ ... │ │
│ └───────────────┘ │
└─────────────────────────────────────────────┘
10.2 Resilience patterns (mandatory per adapter)
- Circuit breaker (e.g.
sony/gobreaker): open after N consecutive failures or X% error rate over a window; half-open probes; close on success. Per-partner state. - Retry with exponential backoff + jitter; cap at 3–5 attempts; only retry on idempotent + retryable errors (5xx, network timeouts). Never retry on 4xx that indicates a business reject.
- Timeout budget per call: e.g. 10s connect, 30s total. Set both, not just one.
- Bulkhead: per-partner worker pool so a slow partner doesn't starve fast ones.
- Idempotency: every outbound call carries our internal idempotency key + a per-partner-required field. Re-issues use the same key.
- Dead-letter queue for permanently failed payouts; ops queue + alert.
10.3 Webhook processing (the most failure-prone surface)
- Webhook endpoints exposed via a separate ingress, behind WAF, with per-partner IP allowlist if the partner provides stable egress IPs.
- Signature verification first, before parsing.
- Always 200 OK quickly — accept the payload, write to durable queue, process async. Partners that get 5xx will retry, and at high volume that causes thundering herds.
- Idempotent processing: deduplicate by
(partner, partner_reference). - All payloads, headers, signatures stored verbatim for at least 90 days. Dispute defense.
10.4 Reconciliation flow
Day N+1, 02:00:
1. Fetch partner statement (SFTP, API, or manual file ingest)
2. Parse into canonical statement model
3. For each partner record: look up our internal record by (partner, reference)
- Match + same amount + same status → OK
- Mismatch / missing → break
4. For each internal record without a partner record → unmatched break
5. Produce report: matched, breaks, totals
6. Auto-resolve trivial breaks (timing, status mappings)
7. Open ops ticket for non-trivial breaks with SLA (e.g. resolve within 48h)
Every reconciliation run is itself audited. No manual ledger adjustment is allowed except via a workflow that references a recon break ID.
10.5 What we hide behind the gateway
- Per-partner credentials (Vault-issued).
- Per-partner retry / timeout config.
- Per-partner rate limits and quotas.
- Per-partner schema differences.
- Per-partner status code → canonical status mapping.
- Per-partner webhook signature scheme.
If a partner is replaced (e.g. switching mobile-money providers), the rest of DemozPay doesn't change.
11. Product Engineering Strategy
11.1 MVP (months 0–4)
Build only what is needed to make payroll + EWA real with one employer.
- Tenancy & Identity (employer + employee accounts, MFA).
- Workforce (employees, departments, salaries — CSV import).
- Payroll (simple structure: gross → tax → pension → net; one pay cycle; manual approve & disburse).
- Ledger (production-grade from day 1 — this is not the MVP cutting corner).
- Wallet (employee wallet view, transaction history).
- EWA (basic draw, fixed cap, repayment auto-netted at payroll).
- One bank/wallet integration (e.g. Telebirr OR CBE — whoever signs first).
- Notifications (SMS for OTP and payroll; email for employer reports).
- Admin console (support: lookup users, view transactions, manual reversal with dual control).
- Audit log + basic reconciliation report.
That's the MVP. Roughly 4 months with the right team.
11.2 What we DO NOT build in MVP
- Loans, BNPL, Equb. (Don't build credit before you have salary truth.)
- ML risk models. (Use rules-based risk.)
- Multi-currency.
- Multi-region.
- Public partner API.
- Mobile app native features beyond what employees need to draw EWA and view balance.
- Anything for "future products" — every line of code is a liability.
11.3 Sequencing (months 4–18)
Month 4–6 Hardening: SOC2 controls, real DR drill, second partner
Month 6–9 Loans v1 (salary-backed term loan), settlement service, KYC vendor live
Month 9–12 BNPL v1, risk engine v1 (rules + a simple model), expanded payroll
(commissions, bonuses, overtime, multi-pay-cycle)
Month 12–15 Equb, expanded admin tooling, regulator reporting automation
Month 15–18 Carve-outs (KYC, Risk, Reporting), regional expansion preparation
11.4 Scaling roadmap
| Phase | Users | Architectural moves |
|---|---|---|
| 0 → 10k | 1 cluster, monolith + 3 satellites, vertical PG, single AZ | Focus on correctness and ops hygiene |
| 10k → 100k | Read replica, partition ledger, Kafka, CDC to ClickHouse, multi-AZ | Carve KYC, Risk, Reporting |
| 100k → 1M | Carve Payroll, Lending; introduce service mesh; Redis cluster; horizontal PG via Citus/sharding by tenant if write-bound | Expand DR to active-passive across DCs |
| 1M+ | Active-active across regions where legally possible; per-region data planes with cross-region settlement | Mature platform team owns this transition |
11.5 How we avoid over-engineering
Three rules:
- No abstraction without 3 use cases. No "future-proof" interfaces.
- Boring tech first. Postgres before Cassandra. NestJS before Rust. Cron before Temporal (until you genuinely need workflows).
- Carve out only on trigger (§2.4). Not on schedule, not on hype.
12. Team Structure & Operating Model
12.1 Founding team (months 0–6, ~8–12 engineers)
- Engineering Lead / CTO (you).
- 2 backend engineers (TS/NestJS) — product surfaces.
- 1 backend engineer (Go) — ledger + integration gateway. Senior; this is the most critical hire.
- 1 mobile engineer (Flutter).
- 1 web engineer (Next.js).
- 1 DevOps / SRE — Kubernetes, CI/CD, observability.
- 1 security engineer / lead — can be part-time or fractional consultant initially, but engaged from week 1.
- 1 QA / test engineer — biased toward automation.
- 1 product manager.
- 1 designer.
- 1 fintech-ops lead — owns the partner relationships, reconciliation reviews, regulator interaction.
12.2 Squad model (months 6–18, ~25–30 engineers)
- Payroll Squad — payroll engine, employer admin, workforce.
- Money Squad — ledger, wallet, settlement, reconciliation.
- Lending Squad — EWA, loans, BNPL, collections.
- Identity & Compliance Squad — auth, KYC, audit, regulator reports.
- Platform Squad — CI/CD, Kubernetes, observability, developer experience.
- Data & Risk Squad — analytics, ML, fraud.
- Mobile Squad.
- Web Squad.
Each squad: ~4–6 engineers, 1 PM, 1 designer, embedded QA, embedded security buddy.
12.3 Cross-cutting functions
- SRE (year 1: 1 person; year 2: 3–5 person team). On-call rotations across all squads with platform SRE leading.
- Security (year 1: 1 lead + external auditor; year 2+: dedicated team of 3+ covering AppSec, IAM, GRC).
- Data Engineering (year 2: dedicated; year 1: shared by platform + risk).
- Fintech Operations (recon, ops, fraud queue, customer escalations) — 2–3 people from launch, scaling with users.
- Compliance / Legal — fractional → in-house as licensing matures.
12.4 Engineering practices
- Trunk-based development, feature flags, CI gates.
- Code review mandatory, with 2 approvers for changes to ledger / auth / payroll calculation / integration gateway.
- Pair programming encouraged, mandatory for critical changes (anything in the "money-touching" set).
- Postmortems blameless, with action items tracked and reviewed monthly.
- Architecture Decision Records (ADRs) for every non-trivial decision. PR-reviewed. Stored in
docs/adr/. - Weekly architecture review for cross-squad changes.
13. Implementation Roadmap
13.1 First 90 days — milestones
| Week | Milestone |
|---|---|
| 1–2 | Repo layout, CI/CD scaffolding, Kubernetes dev cluster, ArgoCD, Vault, base Postgres + Redis. Skeleton NestJS monolith + Go ledger service with gRPC contract. |
| 3–4 | Identity & Auth complete (registration, login, MFA, RBAC). Tenancy bootstrap. Audit log shared library. |
| 5–6 | Workforce module: employees, departments, CSV import. Payroll model + tax-rule engine (deterministic, versioned). |
| 7–8 | Ledger production-ready: posting, idempotency, balance projection, reversal. Wallet façade. First end-to-end "fake disbursement" run. |
| 9–10 | First partner integration in the Integration Gateway (one bank or one wallet). Webhook receiver. Reconciliation v1. |
| 11–12 | EWA flow end-to-end. Notification worker. Admin backoffice v1 (lookup, manual reversal with dual control). |
| 13 | Internal soft launch with 1 employer, ~50 employees. |
13.2 Days 90–180
- Second partner live.
- Settlement & Reconciliation Service formalized.
- SOC2 readiness controls codified.
- KYC vendor integrated, identity proofing flow end-to-end.
- Load testing to 10× projected first-year volume.
- DR drill #1.
13.3 Days 180–365
- Loans v1.
- Risk engine v1 (rules + first model).
- BNPL pilot.
- Carve out Risk Service (Python).
- Regional cloud edge for analytics + non-regulated workloads.
14. Appendix
14.1 Architecture Decision Records to write on day 1
- ADR-001: Modular monolith + 3 satellites (this document).
- ADR-002: Ledger as separate service in Go.
- ADR-003: PostgreSQL as primary store of record.
- ADR-004: Outbox + Kafka for cross-context events.
- ADR-005: Idempotency-Key as required header for money endpoints.
- ADR-006: Balance derived from journal, projection table cached.
- ADR-007: mTLS + JWT forwarding (zero trust internal).
- ADR-008: Vault for secrets; no env-var secrets in prod.
- ADR-009: ArgoCD + GitOps; no
kubectl applyin prod. - ADR-010: Two-language ceiling year 1 (TypeScript + Go).
14.2 Anti-patterns to actively avoid
- ❌ Storing
wallet.balanceas a column updated by application code. - ❌ Sharing a database across services after a service is carved out.
- ❌ Letting any non-Ledger service issue DELETE on financial rows.
- ❌ Optional
Idempotency-Keyheaders. - ❌ Synchronous chains of >3 service hops on a user-facing request path.
- ❌ Treating Kafka delivery as transactional with DB writes (use Outbox).
- ❌ Long-lived service credentials in env vars.
- ❌ Production database access by engineers (any time, for any reason, without break-glass).
- ❌ Manual ledger adjustments without dual control and a recon-break reference.
- ❌ Multi-region active-active before active-passive is proven.
- ❌ Rewriting from scratch when the modular monolith starts feeling slow — carve, don't rewrite.
14.3 The one-page "what is DemozPay engineering" summary
A modular NestJS monolith for product surfaces. A Go ledger service that is the source of truth for money, isolated by language, database, and review process. A Go integration gateway that owns every outbound bank/wallet call with circuit breakers and reconciliation. A separate notification worker. PostgreSQL everywhere it matters. Kafka via the outbox pattern for fan-out. Kubernetes deployments via GitOps with Vault-issued secrets and mTLS between services. JWTs with step-up MFA for users. Idempotency-Key required on every money endpoint. Double-entry, immutable journal. Daily reconciliation against every partner. Two languages in year one, three in year two. Carve out services only on trigger, never on schedule. Boring choices everywhere, except where boredom would silently corrupt the ledger — and there, we are paranoid.
Document owner: CTO / Founding Engineering Version 1.0 — to be revised after first 90 days and quarterly thereafter