Demoz Pay — Core Banking, Microservices, KYC & Resilience Plan
⚠️ Superseded / historical — archived 2026-05-26. DemozPay is a modular monolith with selective Go extraction (see
docs/adr/ADR-001-modular-monolith.mdandADR-010-two-language-ceiling.md) — not microservices-first. The "microservice catalogue" / decomposition sections below do not reflect the current architecture; treat them as exploratory only. Authoritative:docs/adr/+docs/architecture/restructure-2026-05.md. The Ethiopia banking / FI / wallet integration, KYC, and regulatory research here is still useful reference — which is why this is archived, not deleted.
Author: Engineering · Audience: Demoz Pay engineering + leadership Status: Draft v1 · Last updated: 2026-05-23
This document is the canonical reference for how Demoz Pay will integrate with Ethiopian banks, financial institutions (FIs), and wallet operators; how the platform will be decomposed into microservices; how user identity and KYC will be handled; and how the system will achieve fintech-grade security and near-zero-downtime availability.
Table of contents
- Executive summary
- Glossary
- Current state — what we already have
- How core banking actually works in Ethiopia
- Target architecture — overview
- Microservice catalogue
- Bank / FI / Wallet integration layer
- User identity & KYC
- Security architecture (defence-in-depth)
- Data consistency, ledger & idempotency
- High availability & zero-downtime deployments
- Observability & operations
- Regulation & compliance
- Migration plan (strangler fig from monolith)
- Risk register
- Open decisions
1. Executive summary
Demoz Pay is moving from a single-NestJS-monolith + single-Postgres setup toward a domain-driven microservice platform that can plug into many banks, MFIs and wallet operators, ship money safely, and stay up under real load.
The plan rests on five pillars:
- A single internal "money API" — every product (payroll, EWA, BNPL, loans, equb) talks to one normalised payments interface. Each bank / wallet / FI sits behind an adapter that translates that interface to the provider's specifics.
- Domain-owned services, each with its own database, each reachable only via its public API. Cross-service consistency is enforced through events and sagas, never through direct cross-database writes.
- A double-entry ledger (already modelled in Prisma) as the system of record for all value movement. Wallet balances are derived, never authoritative.
- Defence in depth for security: network, application, data, identity, secret, and audit layers each enforce policy independently. PII and money never leave Ethiopia (data residency).
- Operational discipline: blue-green deploys, expand-contract DB migrations, multi-AZ active-active, OpenTelemetry-instrumented, SLO- driven, with runbooks for every critical path.
We do not rewrite the monolith in one go. We use the strangler fig pattern — new domains are extracted one by one, behind a stable API gateway, with feature flags and shadow traffic. The monolith is deprecated last, not first.
2. Glossary
| Term | Definition |
|---|---|
| Core banking system (CBS) | The transactional engine inside a bank that maintains accounts and posts movements. Common products: Temenos T24/Transact, Oracle Flexcube, Infosys Finacle, Path Solutions iMAL. |
| ISO 20022 | International messaging standard for financial transactions. Replacing SWIFT MT formats globally. |
| ISO 8583 | Older message standard, still used for card/POS transactions. |
| PSP | Payment Service Provider (Telebirr, M-Birr, HelloCash, Amole, CBE Birr, etc.). |
| PSD2 / Open Banking | Regulatory frameworks (EU, UK) requiring banks to expose customer-permissioned APIs. NBE has signalled similar in its 2025 directive. |
| NBE | National Bank of Ethiopia — the regulator. |
| Fayda | Ethiopia's national digital ID (FCN/National ID Number). |
| KYC | Know Your Customer — the identity-verification process. |
| AML / CFT | Anti-Money Laundering / Countering the Financing of Terrorism. |
| CDD / EDD | Customer Due Diligence (normal) / Enhanced Due Diligence (high-risk). |
| HSM | Hardware Security Module — tamper-resistant key store. |
| mTLS | Mutual TLS — both client and server present certificates. |
| Saga | A distributed transaction pattern using compensating actions. |
| Outbox | A reliable-events pattern using a DB table as a durable queue. |
| Strangler fig | Migration pattern: route piece by piece away from a monolith until the monolith is gone. |
| SLO / SLI | Service Level Objective / Indicator — measurable reliability targets. |
| RPO / RTO | Recovery Point Objective (acceptable data loss) / Recovery Time Objective (acceptable downtime). |
3. Current state — what we already have
3.1 Code
- Monorepo (Nx 22, pnpm) with one NestJS server and five Next.js
frontend apps (
admin,business,client,fi,bnpl-partner) plus a Docusaurus docs site. - Server modules wired in
AppModule:auth,business,employee,prisma. That's it — no payment, ledger, lending, BNPL, EWA or equb modules are implemented yet despite being modelled in Prisma. - Single
docker-compose.ymlwith one Postgres instance for the whole monorepo. - No event bus, message broker, cache, secrets manager, observability stack, or HSM provisioned.
3.2 Prisma schema (already strong)
The schema is more advanced than the implementation. It already models:
- Identity:
User,Role(RBAC),AdminProfile,UserRoleenum, 2FA fields,AuditLog. - Business:
Business,Department,Employee,EmployeeAbsence. - Double-entry ledger:
LedgerAccount,Transaction,JournalEntry(with debit/credit, account types ASSET/LIABILITY/ EQUITY/REVENUE/EXPENSE). - Payroll:
Payroll,PayrollEntry,PayrollStatus. - Lending:
BusinessLoan,EmployeeLoan,LoanPayment,LoanStatus. - Embedded finance:
EarlyWageAccess,BNPLPurchase,BNPLPayment,Equb,EqubMember,EqubPayout. - Money:
Wallet,WalletTransaction,WithdrawalRequest,BillPayment,Expense,SavingGoal. - Counterparties:
FinancialInstitution,BNPLPartner,Merchant. - Settlement:
SettlementBatch,SettlementRecord,SettlementType,SettlementStatus. - Shared enums:
KYCStatus,PaymentMethod,WithdrawalMethod.
3.3 Gaps
| Capability | Modelled? | Implemented? |
|---|---|---|
| RBAC | ✓ | partial |
| Bank/wallet adapters | ✗ | ✗ |
| KYC orchestration | enum only | ✗ |
| Ledger posting service | ✓ | ✗ |
| Settlement engine | ✓ | ✗ |
| Reconciliation jobs | ✗ | ✗ |
| Event bus | ✗ | ✗ |
| Idempotency layer | ✗ | ✗ |
| Audit immutability (WORM) | log model exists | not WORM |
| Secrets management | env files | env files |
| HSM / KMS | ✗ | ✗ |
| Multi-region / multi-AZ | ✗ | ✗ |
| Observability stack | ✗ | ✗ |
Conclusion: the Prisma schema is a solid blueprint. Implementation is at "zero" for the money paths. This means we don't need to migrate data heavy — we mostly need to build it right the first time while the monolith covers identity and business administration.
4. How core banking actually works in Ethiopia
To integrate "with many banks and many financial institutions", we need a realistic view of what's on the other side of the wire.
4.1 The technology landscape
| Provider | Core banking system | Typical integration surface |
|---|---|---|
| CBE | Temenos T24 + custom | SOAP / file dropbox / SFTP batches; emerging REST APIs through CBE Birr |
| Awash, Dashen, Abyssinia | Oracle Flexcube | SOAP web services; sometimes REST gateways through fintech sandboxes |
| Wegagen, Hibret, NIB | Path Solutions iMAL | SFTP files; some REST via partners |
| Cooperative Bank of Oromia, Zemen | Infosys Finacle | SOAP; ISO 8583 for card rails |
| Microfinance (Omo, ACSI, AdCSI) | Various, sometimes Excel + manual | Often portal upload + manual reconciliation |
| Telebirr | Ethio Telecom — Huawei mobile money | REST API (HMAC + RSA signing); webhook callbacks |
| M-Birr, HelloCash, Amole, CBE Birr | Various | REST APIs; HMAC + JWT or RSA; mobile money rails |
| Card networks (where used) | Visa / Mastercard via card processors | ISO 8583 over MPLS; not your usual REST |
4.2 The patterns you actually encounter
- Synchronous REST/JSON — newest wallet providers and a few "open banking" sandboxes. Easiest, but throughput and uptime vary wildly.
- SOAP / XML — legacy core banking. Strong typing via WSDL, but you'll fight with date/timezone formats and exception channels.
- SFTP batch files — daily/intra-day file drops (often pipe- delimited or fixed-width). Settlement reports, bulk credit files, reversal files. Many bank disbursement rails still work this way in Ethiopia.
- ISO 8583 / ISO 20022 — card/Switch integration (ETHSwitch), high-value transfers (RTGS). You don't speak this directly; you integrate through a processor or the bank's API gateway.
- Webhook callbacks — providers push you success/failure updates asynchronously. Always treat as untrusted until verified.
4.3 What you need at the application boundary
For every provider, regardless of the protocol underneath, you need:
- Authentication — usually OAuth2 client credentials, HMAC-signed requests, or mutual TLS. Often combined.
- Idempotency — every payment send must carry a unique idempotency key that the provider honours. If they don't honour it, we guarantee it on our side by enforcing single-execution semantics on the adapter.
- Webhook signing — HMAC over the body with a shared secret, or detached JWS. Reject anything without a valid signature.
- Reversal & refund semantics — different per provider. Some support reversals within N hours; some require manual intervention.
- Settlement reports — daily files or API endpoints listing everything they consider final. We must reconcile our books to these reports daily before opening for new business each day.
4.4 Operational realities
- Bank APIs are not 24/7. Many sandbox/production endpoints are available only during banking hours, with downtime windows for batch jobs.
- Webhooks are not reliable. They may not arrive, may arrive twice, may arrive out of order. Always reconcile.
- "Same-day settlement" usually means T+1 — same business day if initiated before cut-off, otherwise next business day.
- Limits and FX rules change by directive; we must build a config- driven limits engine, not hard-code thresholds.
5. Target architecture — overview
5.1 Logical view
┌────────────────────────────────────────┐
│ Edge / WAF │
│ Cloudflare-style: DDoS, bot, rate │
└──────────────────┬─────────────────────┘
│ TLS 1.3
┌──────────────────▼─────────────────────┐
│ API Gateway (Kong/Apollo) │
│ AuthN (JWT), AuthZ (OPA), Rate-limit │
└──────────────────┬─────────────────────┘
│ mTLS · internal
┌──────────────────────────────┼──────────────────────────────┐
│ │ │
┌──────▼──────┐ ┌────────────┐ ┌────▼─────┐ ┌────────────┐ ┌────▼──────┐
│ Identity │ │ Business │ │ Payments │ │ Lending │ │ Embed. │
│ & KYC │ │ Tenancy │ │ Orches. │ │ (Loans, │ │ Finance │
│ │ │ │ │ │ │ EWA) │ │ (BNPL, │
│ │ │ │ │ │ │ │ │ Equb) │
└────┬────────┘ └─────┬──────┘ └────┬─────┘ └─────┬──────┘ └────┬──────┘
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌────────────────┐ │ │
│ │ │ Ledger (DBL- │◄──────┴──────────────┘
│ │ │ entry, SoR) │
│ │ └───────┬────────┘
│ │ │
│ ▼ ▼
│ ┌────────────┐ ┌──────────────────┐
│ │ Payroll │ │ Banking-Adapter │
│ │ Engine │ │ Hub (per-prov.) │
│ └────────────┘ └────────┬─────────┘
│ │
│ ├── CBE adapter
│ ├── Telebirr adapter
│ ├── Awash adapter
│ ├── M-Birr adapter
│ └── ...n more
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Cross-cutting services (each its own service, called via events): │
│ Notification · Audit-Log · Reconciliation · Fraud · Compliance · │
│ Reporting · Document-Store · Webhook-Receiver │
└─────────────────────────────────────────────────────────────────────┘
5.2 Communication patterns
| Pattern | When | Why |
|---|---|---|
| REST/JSON via gateway | All client ↔ backend traffic | Stable contract, tooled |
| gRPC + mTLS | Service ↔ service synchronous | Strong typing, low overhead, mTLS by default |
| Async events (Kafka or NATS JetStream) | Anything that crosses domain boundaries (e.g. "PayrollApproved") | Decoupling, replay, durability |
| Outbox table → CDC → bus | Whenever a service emits an event after a DB write | Atomicity between state and event |
| Saga (orchestration) | Multi-service workflows (loan disbursal) | Compensating actions for partial failure |
| Scheduled jobs (cron/temporal) | Daily settlements, EOD reconciliation | Predictable batches |
| SFTP / file watchers | Legacy bank rails | Reality of Ethiopian banking |
5.3 Data ownership
- Each microservice owns its own Postgres schema, not its own database instance (cost-effective at our stage). Logical separation enforced with per-service DB users and row-level security.
- No service reads another service's tables. Only via API or event subscription.
- Common reference data (countries, currencies, business calendars) lives in a small reference-data service and is cached locally.
5.4 The ledger is the system of record
This is non-negotiable. Wallet balances, loan outstanding balances, EWA available balances — all are derived from the double-entry ledger. We never increment a balance directly. We post a journal entry; the read-side projects balances from journals.
This means:
- Bug in projection? Replay from journals.
- Audit asks "why is this number what it is"? We show the journals.
- Reconciliation breaks? We diff our journals against the bank's file.
6. Microservice catalogue
Each entry lists the service's purpose, the Prisma models it owns, its primary public API surface, the events it publishes, and its direct dependencies.
6.1 Identity & KYC service (svc-identity)
- Purpose: User registration, authentication, MFA, RBAC, KYC orchestration, sanctions screening, customer profile.
- Owns:
User,Role,AdminProfile, KYC documents, KYC decisions, 2FA secrets, session tokens. - Public API:
POST /users,POST /sessions,POST /kyc/submit,GET /kyc/status,POST /mfa/verify. - Publishes:
UserRegistered,KYCSubmitted,KYCApproved,KYCRejected,UserDeactivated. - Dependencies: Document-store (for ID images), Sanctions (Dow Jones / World-Check or local list), Fayda API (national ID verification).
- External calls: SMS gateway, email, Fayda, sanctions list.
6.2 Business tenancy service (svc-business)
- Purpose: Business onboarding, departments, employees, business KYC (KYB), commercial agreements, fee schedules.
- Owns:
Business,Department,Employee,EmployeeAbsence. - Public API:
POST /businesses,POST /departments,POST /employees,GET /businesses/:id/employees. - Publishes:
BusinessOnboarded,EmployeeJoined,EmployeeOffboarded. - Dependencies: Identity (to link employees to users), Document- store.
6.3 Ledger service (svc-ledger)
- Purpose: Authoritative double-entry ledger; the system of record for all value movement.
- Owns:
LedgerAccount,Transaction,JournalEntry. - Public API:
POST /transactions(with full journal entries — must balance),GET /accounts/:id/balance,GET /transactions/ :id,POST /transactions/:id/reverse. - Publishes:
TransactionPosted,TransactionReversed,BalanceUpdated. - Invariants: every transaction's debit total = credit total. Rejects unbalanced posts at the DB layer.
- No external dependencies.
6.4 Payments orchestration service (svc-payments)
- Purpose: Single internal "money API". Routes outbound payments to the right adapter; receives webhook callbacks; updates the ledger; enforces idempotency.
- Owns: Payment intents, payment attempts, idempotency keys, webhook event log, routing rules.
- Public API:
POST /payments/intents(with idempotency key),GET /payments/:id,POST /payments/:id/cancel. - Publishes:
PaymentInitiated,PaymentSucceeded,PaymentFailed,PaymentReversed. - Dependencies: Banking-adapter hub, Ledger, Risk (fraud), Audit.
6.5 Banking-adapter hub (svc-bank-adapters)
- Purpose: One adapter per provider; uniform internal contract.
- Owns: Provider configs, encrypted credentials, routing rules, per-provider rate limits, circuit breaker state.
- Adapters (one per provider):
cbe-adapter— CBE Birr REST, SFTP for bulk.telebirr-adapter— Telebirr REST + webhooks.awash-adapter,dashen-adapter,abyssinia-adapter,cooperative-adapter,zemen-adapter,nib-adapter, …mbirr-adapter,hellocash-adapter,amole-adapter, …sftp-adapter— generic SFTP batch sender for legacy rails.
- Common adapter interface (gRPC contract):
Send(idempotencyKey, amount, currency, source, dest, ref) → AttemptIdGetStatus(AttemptId) → StatusRefund(AttemptId, amount, ref) → AttemptIdWebhookHandler(rawBody, signature) → InternalEventListSettlement(fromDate, toDate) → SettlementBatch
- Publishes:
ProviderPaymentSucceeded,ProviderPaymentFailed,ProviderSettlementReady. - Dependencies: Secrets vault, Network egress to provider endpoints.
6.6 Settlement & reconciliation service (svc-settlement)
- Purpose: End-of-day batches, reconciliation, dispute resolution workflow.
- Owns:
SettlementBatch,SettlementRecord, reconciliation diff reports, disputes. - Public API:
POST /settlements/run,GET /settlements/:id,GET /reconciliation/diff?date=…. - Publishes:
SettlementCompleted,ReconciliationBreakFound. - Dependencies: Banking-adapter hub, Ledger, Notifications (when break found, alert finance).
6.7 Payroll service (svc-payroll)
- Purpose: Payroll cycle management, calculations (PAYE, pension, withholdings), payslip generation, approval workflow, disbursement orchestration.
- Owns:
Payroll,PayrollEntry. - Public API:
POST /payrolls,POST /payrolls/:id/approve,POST /payrolls/:id/disburse,GET /payrolls/:id/payslips. - Publishes:
PayrollDrafted,PayrollApproved,PayrollDisbursed. - Dependencies: Business, Identity, Payments, Ledger, Tax-engine (could be embedded initially).
6.8 Lending service (svc-lending)
- Purpose: Business and employee loans, applications, underwriting decisions, repayment schedules.
- Owns:
BusinessLoan,EmployeeLoan,LoanPayment. - Public API:
POST /loans/apply,POST /loans/:id/disburse,POST /loans/:id/repay. - Publishes:
LoanApplied,LoanApproved,LoanDisbursed,LoanRepayment,LoanWrittenOff. - Dependencies: Identity (KYC), Business, Payments (for disbursement and repayment), Ledger, Risk (for underwriting).
6.9 Embedded finance service (svc-embedded-finance)
- Purpose: Early Wage Access, BNPL, Equb, Bill Payments — features that ride the same money rails but have their own business logic.
- Owns:
EarlyWageAccess,BNPLPurchase,BNPLPayment,Merchant,Equb,EqubMember,EqubPayout,BillPayment. - Public API:
POST /ewa/request,POST /bnpl/purchases,POST /equbs,POST /bills/pay. - Publishes:
EWAGranted,BNPLApproved,EqubCycleClosed. - Dependencies: Identity, Lending (for credit decisions), Payments, Ledger, Risk.
6.10 Risk & fraud service (svc-risk)
- Purpose: Real-time fraud scoring, velocity limits, sanctions re-screening on every transaction, AML pattern detection.
- Owns: Rules engine, fraud scores, decision logs, suspicious activity reports (SAR).
- Public API:
POST /risk/evaluate(returns ALLOW / REVIEW / BLOCK in <100 ms). - Publishes:
HighRiskTransactionDetected,SARFiled. - Dependencies: Identity (KYC tier), Ledger (transaction history), Sanctions list.
6.11 Notification service (svc-notify)
- Purpose: Multi-channel notifications (SMS, email, push, in- app). Templating, localisation (Amharic / English), rate-limiting, preference enforcement.
- Owns: Templates, delivery logs, user preferences.
- Public API:
POST /notifications/send(event-driven, mostly). - Subscribes to: every event that triggers a user-facing message.
6.12 Audit-log service (svc-audit)
- Purpose: WORM (write-once-read-many) audit log of every privileged action and every money movement. Append-only, integrity- hashed.
- Owns:
AuditLog(extended with hash chain). - Public API:
POST /audit/events(internal only),GET /audit/events(regulators, internal review). - Storage: Hot in Postgres, warm in S3 (immutable bucket with object lock), hash-chained per business per day.
- Subscribes to: every domain event.
6.13 Document-store service (svc-documents)
- Purpose: Encrypted storage for KYC documents, contracts, payslips. Pre-signed download URLs.
- Owns: Metadata index; binaries in encrypted object storage (S3-compatible, server-side encryption with KMS keys).
- Public API:
POST /documents(multipart, virus-scanned),GET /documents/:id/url(short-lived signed URL).
6.14 Webhook receiver (svc-webhooks)
- Purpose: Public endpoint for provider callbacks. Validates signatures, drops invalid, persists raw payload, hands off to Payments via internal event.
- Owns: Raw webhook event log (immutable).
- Public API:
POST /webhooks/:provider(one path per provider).
6.15 Reporting / read-models (svc-reports)
- Purpose: Subscribes to all domain events, builds projected read models for dashboards (admin / business / FI / BNPL-partner / client). No business logic; just denormalised views.
6.16 Reference-data service (svc-reference)
- Purpose: Banks list, branches, BIC codes, currencies, business calendars, holidays, fee schedules, tax tables, regulatory limits. Cached aggressively. Read-mostly.
7. Bank / FI / Wallet integration layer
7.1 The adapter pattern
We define one internal contract:
interface PaymentAdapter {
/** Initiate an outbound payment */
send(input: {
idempotencyKey: string;
amount: Decimal;
currency: 'ETB';
source: { // We are paying from…
accountNumber: string;
accountName?: string;
};
destination: { // We are paying to…
kind: 'BANK_ACCOUNT' | 'WALLET';
identifier: string; // account # or wallet phone
accountName?: string;
bankCode?: string; // for BANK_ACCOUNT
};
reference: string; // shows on the customer's statement
metadata: Record<string, string>;
}): Promise<{ attemptId: string; status: AttemptStatus }>;
/** Query status */
getStatus(attemptId: string): Promise<AttemptStatus>;
/** Refund / reversal */
refund(attemptId: string, amount: Decimal, reference: string)
: Promise<{ refundId: string; status: AttemptStatus }>;
/** Parse and validate an incoming webhook */
parseWebhook(rawBody: Buffer, headers: Record<string,string>)
: Promise<NormalisedProviderEvent>;
/** Pull settlement data (daily) */
pullSettlement(date: Date): Promise<NormalisedSettlement>;
/** Provider-specific health check */
health(): Promise<HealthResult>;
}
Every adapter implements this. The Payments service never knows the difference between Telebirr and CBE.
7.2 Adapter responsibilities
- Auth handshake (OAuth / HMAC / mTLS) → handled per adapter.
- Request shaping → translate normalised → provider format.
- Response normalisation → translate provider → internal.
- Idempotency → the adapter enforces single-execution even if the provider doesn't.
- Circuit breaker, retries with exponential back-off + jitter, bulkhead pool isolated per provider.
- Provider-specific error mapping to a small internal taxonomy.
7.3 Onboarding a new bank (the cookbook)
Estimate per new bank: 2–4 engineer-weeks depending on protocol.
- Legal & commercial — sign integration agreement, get sandbox credentials, agree on cut-off times and limits.
- Connectivity — provision IPs to allow-list, set up mTLS or API key vault entry. Confirm latency from our infra.
- Implement adapter behind the common interface.
- Contract tests in CI against a recorded sandbox.
- Shadow mode — for two weeks, route a copy of real payment intents through the new adapter without actually sending, and diff the result with the live adapter.
- Canary — route 1% of eligible traffic; expand to 10%, 50%, 100% over a week with manual gating between steps.
- Settlement reconciliation — run for ≥ 14 days with zero unresolved breaks before declaring GA.
- Runbook — every adapter ships with: cut-off times, contact numbers, common error mappings, rollback procedure.
7.4 Routing rules (when there are multiple providers)
The Payments service picks the adapter at runtime based on:
- Destination kind: wallet number → wallet rail; bank account → bank rail.
- Bank code or wallet operator prefix (extracted from account number / phone number).
- Health & capacity: prefer healthy adapters; downgrade to fall- back when one is degraded.
- Cost tiering: cheapest acceptable rail for the amount and speed required.
- Cut-off times: if a same-day rail is past its cut-off, fall back to a 24/7 rail (wallet) or queue for next business day.
- Customer preference: if the user has pinned a provider.
Routing is config-driven (Prisma JSON on FinancialInstitution
and a new Provider table), reviewable in the admin portal, with a
full audit history of changes.
8. User identity & KYC
8.1 Registration — what we collect
We use a tiered KYC model — common in mobile money — so that small wallets can open instantly while higher-value services require more verification.
| Tier | Limits (illustrative — confirm with NBE) | Required data |
|---|---|---|
| Tier 0 — Anonymous browse | No money operations | Phone number, email, password, accepted T&C |
| Tier 1 — Light wallet | Daily ≤ 5,000 ETB, monthly ≤ 30,000 ETB | + Full legal name, DOB, gender, Fayda ID, selfie liveness |
| Tier 2 — Standard | Daily ≤ 30,000, monthly ≤ 300,000 | + Address, occupation, source of funds, Fayda biometric match |
| Tier 3 — Business | Per agreement | + Business registration, TIN, beneficial owners, board resolution, signed mandates |
| Tier 3 — High value | Higher limits | + Enhanced Due Diligence (EDD), PEP screening, manual review |
8.2 Data captured at user registration
Minimum field set in User (extend the schema accordingly):
id(cuid)phone(E.164, unique, primary identifier in Ethiopia)email(unique, optional at Tier 0, required at Tier 1+)password(Argon2id hash, not bcrypt; pepper from KMS)firstName,middleName,lastName(Tier 1+)dateOfBirth,gender(Tier 1+)nationalId(Fayda FCN, unique, encrypted at rest)idDocumentType(e.g. PASSPORT, KEBELE_ID — fallback if no Fayda)idDocumentRef(document service ID)addressLine1,region,subCity,kebele(Tier 2+)occupation,sourceOfFunds,expectedMonthlyVolume(Tier 2+)kycStatus(PENDING → IN_REVIEW → VERIFIED / REJECTED)kycTier(0/1/2/3)kycSubmittedAt,kycDecisionAt,kycDecisionByriskRating(LOW / MEDIUM / HIGH — drives EDD)pepFlag(Politically Exposed Person)sanctionsMatchAt(last screen result timestamp)consents(JSON: T&C version, privacy policy version, marketing opt-in, with timestamps and IP)createdAt,updatedAt,deletedAt,lastLoginAt,failedLoginCount- Soft-delete (
deletedAt) not hard delete — regulators require retention.
For businesses (KYB), capture:
- Legal name, trade name, business type, TIN, VAT number, registration number, registration date, registering authority.
- Industry classification (NBE category).
- Registered address; physical operating address.
- Directors / authorised representatives — each with their own user record and Tier 2+ KYC.
- Beneficial owners (≥ 25% ownership) — each individually verified.
- Bank account(s) for settlement.
- Board resolution / mandate authorising platform use.
8.3 Registration flow
Phone +
Email + ──► OTP sent ──► OTP verified ──► Password set ──► T&C
Password to phone to phone Argon2id consent
Tier 0 wallet created (no funds)
│
▼
Identity service
publishes UserRegistered
Tier 1 upgrade:
Name, DOB, gender, Fayda
Selfie + liveness (server-side check)
Fayda biometric match (svc-identity → Fayda API)
Sanctions screen
Risk score
──► auto-VERIFY (low risk + clean Fayda) or QUEUE for review
Tier 2 upgrade:
Address proof, occupation, source-of-funds declaration
Re-screen + risk re-score
Manual or rule-based decision
Business (Tier 3):
KYB workflow — admin review mandatory, no auto-approve.
8.4 KYC controls
- Document collection through
svc-documents. Server-side virus scan (ClamAV in a side-car), file-type sniff (not just extension), EXIF strip, size cap (25 MB). - Liveness — challenge-response with random head movements, server-side analysis. Cap at 3 attempts per 24 hours.
- Sanctions screening — daily re-screen of every active user against the platform's curated list (NBE list + UN consolidated + OFAC if relevant).
- PEP screening — at onboarding and on update.
- Adverse media — at Tier 3 only.
- Re-KYC — driven by changes in risk rating, transaction patterns, age of last verification (every 12 months for HIGH risk, 36 months for LOW).
- Right to be forgotten — implemented as cryptographic shredding of PII keys (per-user KMS key) plus tombstoning the record. The ledger entries stay (regulatory retention) but are pseudonymised.
8.5 Authentication & session
- First factor: phone + password (Argon2id, 64 MB / 3 iterations / parallelism 4).
- Second factor: TOTP (always available) + WebAuthn (preferred on supported devices). SMS OTP allowed as fallback but rate- limited and never the only factor for sensitive actions.
- Sessions: short-lived access tokens (15 min) + rotating refresh tokens (7 days, single-use, refresh-token reuse triggers global session invalidation).
- Step-up auth required for: changing destination accounts, initiating payments > ETB 50k, adding a new beneficiary, disabling 2FA.
- Device binding: each device gets a per-device JWK fingerprint, bound to the refresh token.
9. Security architecture (defence-in-depth)
Defence in depth means every layer enforces policy independently. A breach of one does not give the attacker the next.
9.1 Network layer
- Cloudflare-class WAF + DDoS at the edge. Custom rules for known attack patterns; bot management; geo-fencing for admin endpoints.
- Public ingress only into the API gateway and the webhook receiver. No service is reachable from the internet directly.
- Private VPC. Each service in its own subnet. Egress through NAT with a deny-by-default allow-list.
- mTLS between every service. Certificates issued by an internal CA with 24-hour rotations; SPIFFE/SPIRE for identity.
- Egress allow-list for outbound calls to provider endpoints; logged and alerted on unknown egress attempts.
9.2 Application layer
- Input validation at the gateway via JSON schema + business validators in each service (Zod or class-validator).
- Output encoding. No HTML in JSON responses. CSP, HSTS, COOP/COEP headers set at the gateway.
- No raw SQL anywhere. Prisma everywhere; reviewed for type- unsafe extensions.
- Dependency scanning (Snyk / Dependabot) in CI. CVE deadline: Critical = 48 h, High = 7 d, Medium = 30 d.
- SAST (Semgrep + ESLint security plugins) in CI.
- SCA (Software Composition Analysis) on every build.
- Container scanning (Trivy) before image promotion.
9.3 Data layer
- Encryption at rest for all data stores. Postgres with TDE- equivalent (filesystem-level + per-column encryption for PII). Object storage encrypted with KMS.
- Encryption in transit — TLS 1.3 only on all hops. PSP-level TLS pinning where the provider supports it.
- Tokenisation of bank account numbers and Fayda IDs. The raw value is stored only in the Identity / Documents service; every other service references a token.
- Per-tenant key isolation — each business has its own data- encryption key, derived from a master key in KMS.
- Backups — encrypted, off-region, tested monthly via restore drills.
- No PII in logs. Structured logging with a redaction layer that
drops fields tagged
@sensitive.
9.4 Identity & secret layer
- All secrets in HashiCorp Vault (or AWS Secrets Manager). Apps receive short-lived dynamic credentials at boot, refreshed every hour.
- Per-service service-account; JIT issuance.
- HSM-backed signing keys for: JWT signing, webhook signature verification, payment-message signing.
- Break-glass procedure with two-person rule on critical secrets.
- Workforce SSO via OIDC for the admin portal; SCIM-managed provisioning; MFA required.
9.5 Audit layer
- WORM audit log: every privileged action (transfer, KYC
decision, role change, secret access) emits an event to
svc- audit. Stored in Postgres + replicated to S3 with Object Lock (governance retention 7 years, regulatory retention). - Hash chain: each daily batch of audit events is Merkle-rooted; the root is published to an internal append-only log signed with the HSM. Tampering is detectable retroactively.
- SIEM: stream audit + security events to a SIEM (Wazuh or Datadog Security Monitoring) with playbooks for: unusual login velocity, privileged-action anomalies, payment-routing changes, webhook signature failures.
9.6 Operational layer
- All admin actions in production are broken-glass, time-boxed (< 60 min), require ticket reference, and produce a recorded session.
- Code that touches money requires two-person review on PR.
- Production data access is forbidden by default; granted JIT for a specific ticket, capped at 2 hours, audited.
- Backups tested via restore drills monthly; failover drills quarterly.
10. Data consistency, ledger & idempotency
10.1 The double-entry rule
Every value movement is one Transaction with two or more
JournalEntry rows whose debits == credits. Examples:
Payroll disbursement of 10,000 ETB to employee E1:
DR Payroll Expense 10,000
CR Cash / Bank Settlement Account 10,000
Wallet credit on employee side:
DR Cash / Bank Settlement Account 10,000
CR Employee Wallet (E1) 10,000
Wallet balance = sum of debits − sum of credits on the wallet account. Read from the projection; reconcile to journals nightly.
10.2 Idempotency
Every state-changing public API requires Idempotency-Key
(client-generated UUID, 24-hour TTL). The gateway stores the first
response keyed by (client_id, idempotency_key) and returns the
same response on retries.
Internal services apply the same rule on their own gRPC interface using a request-ID header.
10.3 Outbox pattern
When a service writes to its DB and must emit an event, it
writes the event to an outbox table in the same transaction. A
small relay process (Debezium or a custom worker) ships outbox rows
to the bus and marks them sent. This guarantees at-least-once
delivery without distributed transactions.
Consumers are designed idempotent (event-ID dedupe table) so "at-least-once" is safe.
10.4 Saga (orchestration) for cross-service workflows
Example: Employee loan disbursement.
1. svc-lending : create EmployeeLoan (PENDING)
2. svc-risk : evaluate → APPROVE
3. svc-ledger : reserve funds (post pending journal)
4. svc-payments : initiate payment via adapter
5. (callback) : svc-payments updates → SUCCESS
6. svc-ledger : finalise journal (move from pending to posted)
7. svc-lending : mark loan DISBURSED, schedule repayments
If step 4 fails: orchestrator triggers compensating actions (release ledger reservation, mark loan as FAILED, notify customer).
Implementation: Temporal (preferred) or a NestJS saga library. Temporal gives durable execution, retry, and replay for free.
10.5 Reconciliation
Every morning, svc-settlement pulls each provider's daily
settlement report, joins to our ledger journals, and produces a
diff report:
- Matched (✓)
- Provider has, we don't → investigate (possible missed webhook)
- We have, provider doesn't → investigate (possible double-spend)
- Amount mismatch → investigate
Unresolved breaks block opening for new business in that provider channel until cleared. Disputes follow a manual workflow with a tracked audit trail.
11. High availability & zero-downtime deployments
11.1 Availability targets (SLOs)
| Capability | Availability target | Latency p99 | RPO | RTO |
|---|---|---|---|---|
| Identity / auth | 99.95% | 300 ms | 1 min | 5 min |
| Payments init | 99.95% | 500 ms | 0 (durable) | 5 min |
| Webhook receipt | 99.99% | 200 ms | 0 (durable) | 1 min |
| Admin portal | 99.9% | 1 s | 5 min | 30 min |
| Reporting | 99% (read-only) | 2 s | 1 hour | 4 hours |
We don't promise more than the underlying bank rails can deliver on the customer journey. The bank rail's downtime is communicated in-product.
11.2 Topology
- Multi-AZ active-active in the primary region (Addis or nearest cloud region with NBE-acceptable data residency).
- Postgres: synchronous replication to one in-region standby + async to a DR region. Connection pooling via PgBouncer (transaction pooling) at each service.
- Redis: clustered, multi-AZ, primary + replicas; used only for caches and rate-limit counters — never as a source of truth.
- Kafka/NATS: 3-node cluster across AZs, replication factor 3, min- in-sync replicas 2.
- Object storage: multi-AZ by default; cross-region replication for audit / documents.
11.3 Deploy strategy
- Blue-green for stateless services (gateway, all microservices). Cut over via the gateway after the new colour passes synthetic checks.
- Canary for risky changes: 1% → 10% → 50% → 100% with auto- rollback on SLO burn.
- Feature flags (Unleash or OpenFeature) on every new payment flow, so production exposure can be reduced to zero in seconds.
11.4 Zero-downtime database migrations
Always expand → migrate → contract, never break-and-replace:
- Expand — add new columns/tables, nullable, optional. Deploy.
- Backfill — populate new columns in chunks with throttling.
- Migrate — switch reads/writes to new columns behind a flag.
- Validate in production with shadow reads.
- Contract — remove old columns in a later release.
Prisma migrations are committed; every migration reviewed for
locking behaviour (long lock = forbidden). Tools: pg_repack for
table rewrites, pgroll for safer multi-step migrations.
11.5 Graceful shutdown and back-pressure
- All services handle SIGTERM: stop accepting new requests, drain in-flight (capped at 30 s), then exit.
- Bulkheads — each downstream provider has its own connection pool, so a slow provider can't starve the rest.
- Circuit breakers (resilience4j-equivalent) per adapter, per endpoint. Open at 50% errors over 60 s, half-open after 30 s.
11.6 Disaster recovery
- DR region with async replication.
- Failover tested every quarter (game-day).
- Backups (PITR Postgres) every 5 min, retained 30 days hot + 7 years cold.
- Documented runbook for every "what if X fails" — region, AZ, Postgres, Kafka, a provider, the gateway.
12. Observability & operations
12.1 The three signals
- Logs: structured JSON, correlation ID per request (W3C trace- parent). Loki or OpenSearch. PII-redacted.
- Metrics: Prometheus, scraped by service-mesh. Default RED metrics (Rate / Errors / Duration) per endpoint, USE metrics per resource. Custom business metrics: payments initiated, payments succeeded, settlement matched %, KYC decisions per hour.
- Traces: OpenTelemetry, sampled at 10% normally, 100% for payments. Backend: Tempo / Honeycomb.
12.2 SLO-driven on-call
- Every critical service publishes its SLO (availability + latency) to a dashboard. Burn-rate alerts page on-call when error budget burns > 14× normal.
- Pages are rare and respected. If a page is wrong, the postmortem fixes the alert.
- Runbooks linked in every alert.
12.3 Synthetic monitoring
- Probe the full payment path (init → adapter sandbox → callback → ledger update) every 5 minutes from outside.
- Probe each adapter's
health()every minute; degrade routing automatically on failure.
12.4 Blameless postmortems
For every customer-impacting incident: written within 5 business days, blameless, ships with at least one action item per root cause. Reviewed in engineering all-hands monthly.
13. Regulation & compliance
We are a fintech in Ethiopia handling money on behalf of others. This is a regulated environment.
13.1 NBE (National Bank of Ethiopia)
- Hold the relevant licence for the service category (Payment System Operator, Payment Instrument Issuer, etc. — confirm with legal).
- File the directive-required reports: transaction volumes, KYC metrics, suspicious activity reports.
- Data residency: customer data and transactional records remain inside Ethiopia unless an explicit exception is granted. Plan infra accordingly (in-country DC or NBE-approved cloud).
- Implement and test BCP (Business Continuity Plan) per NBE's expectations.
13.2 PCI DSS
- Scope: if any rail involves a card (PAN), even indirectly, PCI applies. Strongly prefer routing all card interactions through a PCI-certified processor so we are out-of-scope or SAQ-A scope.
- If in scope: tokenise PANs at the perimeter; never store PAN in our DBs; segment cardholder data environment (CDE) with separate VPC and stricter controls.
13.3 AML / CFT
- Documented AML programme with a designated officer.
- KYC tiers, transaction monitoring rules, SAR filing workflow.
- Customer Risk Rating updated continuously; EDD for HIGH.
- Sanctions screening: NBE list + UN consolidated + OFAC where applicable. Daily re-screen of the customer base.
- Record retention: 7 years post-relationship.
13.4 Privacy
- Lawful basis for processing — contractual necessity (payments) + legal obligation (KYC) + legitimate interest (fraud).
- Subject access requests served within 30 days.
- Privacy by design and default — KYC documents have a default retention of 7 years post-closure, automatically purged.
13.5 Information security management
- Target ISO/IEC 27001 certification within 18 months of launch.
- Document and operate to an ISMS — risk register, asset register, access control policy, change management policy, incident response policy, vendor management policy.
14. Migration plan (strangler fig from monolith)
We do not rewrite. We extract.
14.1 Phase 0 — Foundation (4–6 weeks)
Goal: nothing visible to users yet, but the floor is solid.
- Provision: VPC, multi-AZ Kubernetes (or Nomad), Postgres HA, Kafka/NATS, Redis, Vault, S3 (or Wasabi), KMS.
- Set up CI/CD with: lint, test, SAST, SCA, container scan, signed images (cosign), promotion gates per environment.
- Bring up observability stack: Prom / Grafana / Loki / Tempo / alerting.
- Add
idempotency-keymiddleware to the existing monolith on every state-changing endpoint. - Add an
audit_logoutbox + relay in the monolith. - Add
outboxandinboxtables to the monolith and emit domain events for the events we know we'll need.
14.2 Phase 1 — Identity & KYC (4–6 weeks)
Goal: identity is owned by a new service; the monolith calls it.
- Extract
svc-identitywith its own Postgres schema; replicate initial user data; switch monolith to call it for auth. - Add the KYC tier model, document store integration, Fayda hook, sanctions screening.
- Migrate sessions / refresh tokens to the new service.
- Cut over with feature-flag, dual-write during a parallel window,
then point all auth traffic to
svc-identity.
14.3 Phase 2 — Ledger + Payments (8–10 weeks)
Goal: a single internal money API exists.
- Extract
svc-ledger. The schema is already there; just stand it up and write the journal-posting service. - Build
svc-paymentswith idempotency, intent → attempt model, webhook ingestion. - Build the adapter framework + two adapters first: Telebirr + CBE (the highest-volume rails).
- Build
svc-webhooksas the public-facing receiver. - Migrate any existing money flow in the monolith to call
svc- payments. Most flows aren't yet implemented, which makes this easier than usual.
14.4 Phase 3 — Settlement & reconciliation (4 weeks)
- Build
svc-settlementwith daily batch + diff reporting. - Operate it in shadow mode (no auto-settle) for 2 weeks; then enable auto-settle once breaks are < 0.05%.
14.5 Phase 4 — Lending, EWA, BNPL, Equb (parallelisable, ~3 months)
- Extract each product into
svc-lendingandsvc-embedded- finance. Each one is straightforward once Payments + Ledger exist.
14.6 Phase 5 — Decommission monolith (2 weeks)
- The only thing left in the monolith should be the business /
employee CRUD. Either keep it as
svc-businessor fold its remaining responsibilities intosvc-identity/svc-business. - Delete the monolith repo's old paths; tag a final image; archive.
14.7 New bank/wallet onboarding cadence
After Phase 3 is shipped, the goal is one new bank/wallet adapter every 2 weeks with the cookbook in §7.3.
15. Risk register
| # | Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|---|
| 1 | Provider outage during peak | High | Medium | Multi-provider routing, fallback rail, in-product comms |
| 2 | Webhook signature secret leak | Low | High | HSM-backed, rotated quarterly, alert on misuse |
| 3 | Double-spend through retry | Medium | High | Idempotency keys; ledger reservation step |
| 4 | KYC data breach | Low | Critical | Tokenisation, per-tenant keys, no PII in logs, audit log |
| 5 | Reconciliation breaks pile up | Medium | High | Daily diff alerts, block channel on > N unresolved |
| 6 | DB migration brings down a service | Low | High | Expand-contract only, peer review, staging dress rehearsal |
| 7 | Saga compensation incomplete | Medium | High | Temporal-based durable orchestration; chaos test |
| 8 | Sanctions match missed | Low | Critical | Daily re-screen, fuzzy match, manual review queue |
| 9 | NBE regulatory change | Medium | Variable | Compliance officer monitors directives; config-driven limits |
| 10 | Insider misuse | Low | Critical | Two-person rule, JIT access, audit, rotation, anomaly alerts |
| 11 | DoS on webhook receiver | Medium | Medium | WAF, rate-limit, queue-back-pressure |
| 12 | Key data loss in DR scenario | Low | Critical | Multi-region async + tested restore + immutable audit S3 |
16. Open decisions
Items that need a call from leadership / architecture before detailed design can start.
- Cloud provider — AWS / Azure / local DC? Drives a lot of procurement and the data-residency story.
- Service mesh — Istio vs Linkerd vs just mTLS sidecars? Trade operational cost for traffic policy power.
- Event bus — Kafka (heavier, more featureful) vs NATS JetStream (lighter, simpler). Recommendation: NATS for now, migrate to Kafka if/when stream-processing demand grows.
- Workflow engine — Temporal (recommended) vs custom NestJS sagas. Temporal saves months of engineering.
- HSM / KMS — cloud KMS (AWS KMS / Azure Key Vault) vs local HSM. Cloud KMS is enough for v1; revisit for licensed signing requirements.
- National ID integration — Fayda API availability and SLA? Backup ID document flow for users without Fayda yet.
- Hosting — fully in-country (latency, data residency, harder ops) vs nearest cloud region (faster to ship, possible NBE pushback). Need legal opinion.
- Brand — finalise positioning so SEO / public copy stays
consistent. (See
seo/multi-rail-positioningbranch on the landing repo.)
Appendix A — Suggested repo layout (future state)
Demoz-Pay/
├── apps/
│ ├── gateway/ # API gateway + auth/JWT/RL
│ ├── svc-identity/ # NestJS service
│ ├── svc-business/
│ ├── svc-ledger/
│ ├── svc-payments/
│ ├── svc-bank-adapters/ # one app, many adapter modules
│ ├── svc-settlement/
│ ├── svc-payroll/
│ ├── svc-lending/
│ ├── svc-embedded-finance/
│ ├── svc-risk/
│ ├── svc-notify/
│ ├── svc-audit/
│ ├── svc-documents/
│ ├── svc-webhooks/
│ ├── svc-reports/
│ ├── svc-reference/
│ ├── admin/ # Next.js frontends — unchanged
│ ├── business/
│ ├── client/
│ ├── fi/
│ ├── bnpl-partner/
│ └── docs/
├── libs/
│ ├── contracts/ # gRPC .proto + generated types
│ ├── events/ # Domain event schemas (Avro/JSON)
│ ├── sdk/ # Internal SDK to call services
│ ├── ui/ # shared frontend components
│ ├── domain/ # shared domain types & validators
│ └── observability/ # logger, tracer, metrics helpers
├── infra/
│ ├── terraform/
│ ├── kubernetes/
│ └── runbooks/
└── prisma/ # per-service schemas live in svc-*
Appendix B — Adapter contract (gRPC)
syntax = "proto3";
package demoz.bankadapter.v1;
service BankAdapter {
rpc Send(SendRequest) returns (SendResponse);
rpc GetStatus(GetStatusRequest) returns (AttemptStatus);
rpc Refund(RefundRequest) returns (RefundResponse);
rpc PullSettlement(PullSettlementRequest) returns (SettlementBatch);
rpc Health(google.protobuf.Empty) returns (HealthResponse);
}
message SendRequest {
string idempotency_key = 1;
string amount_minor_units = 2; // e.g. "1500000" = ETB 15,000.00
string currency = 3; // "ETB"
Source source = 4;
Destination destination = 5;
string reference = 6;
map<string,string> metadata = 7;
}
message Destination {
enum Kind { BANK_ACCOUNT = 0; WALLET = 1; }
Kind kind = 1;
string identifier = 2; // acct # or phone (E.164)
string account_name = 3;
string bank_code = 4;
}
// (… rest elided for brevity, see libs/contracts when implemented …)
Appendix C — Initial telemetry catalogue (must-have on day one)
| Metric | Type | Notes |
|---|---|---|
http_requests_total{service,route,status} | counter | per-endpoint |
http_request_duration_seconds{service,route} | histogram | p50/p95/p99 |
db_query_duration_seconds{service,query} | histogram | watch P95 |
bus_consumer_lag{consumer,topic} | gauge | <5 s healthy |
payments_initiated_total{provider} | counter | per-rail |
payments_succeeded_total{provider} | counter | per-rail |
payments_failed_total{provider,reason} | counter | per-rail, per-reason |
webhook_signature_invalid_total{provider} | counter | alert if >0 |
kyc_decision_total{tier,outcome} | counter | business metric |
settlement_diff_total{provider,bucket} | counter | matched / missing / extra |
ledger_unbalanced_post_attempts_total | counter | must be 0 |
17. Polyglot tech stack — Node + Spring Boot + Go
Microservices win when you can pick the right language for the right problem. NestJS got us to MVP; for the production platform we'll deliberately mix three languages, each chosen for what it's genuinely good at.
17.1 Why polyglot
| Strength | Weakness in our context | |
|---|---|---|
| Node.js / NestJS | Web-facing services, BFF for our Next.js apps, JSON, fast iteration, shared types with frontend | Single-threaded; GC pauses under sustained heavy throughput; weaker for bank SOAP/ISO 8583 |
| Spring Boot (Kotlin or Java 21) | The Java ecosystem owns banking integration: CXF for SOAP, jPOS for ISO 8583, BouncyCastle for crypto, Drools for rules. Mature observability (Micrometer), threading, transactions | Heavier images, slower startup, more verbose code |
| Go | Lowest latency + highest throughput per CPU. Excellent gRPC support, goroutines for parallel fan-out, tiny static binaries, fast cold start (matters for K8s scale-out) | Smaller library ecosystem than JVM; not ideal for complex domain rules |
17.2 Service-to-language mapping (recommended)
| Service | Language | Why |
|---|---|---|
svc-identity | NestJS | Already in our stack; mostly CRUD + JWT + integrations |
svc-business | NestJS | CRUD-heavy; shares types with frontends |
svc-payments | Go | Highest QPS path; needs predictable latency and tiny memory footprint; the fan-out + idempotency engine maps cleanly to goroutines and channels |
svc-ledger | Go | Hot write path; deterministic; the journal-posting service must be the fastest service in the platform |
svc-bank-adapters (hub) | Spring Boot (Kotlin) | Banks speak SOAP, ISO 8583, fixed-width files. jPOS, Apache CXF, Spring Integration, Spring Batch make these trivial in JVM and painful elsewhere |
svc-settlement | Go | Batch processing of large files; concurrent diff jobs |
svc-payroll | NestJS | Complex business logic; benefits from shared TS types with the business portal |
svc-lending | Spring Boot | Underwriting rules engine (Drools), score cards, complex decision trees |
svc-embedded-finance (EWA, BNPL, Equb) | NestJS | Product features with rich UI parity; co-evolves with the client app |
svc-risk | Spring Boot | Real-time scoring with Drools; fits with svc-lending |
svc-notify | NestJS | I/O bound (SMS/email/push); rich template ecosystem |
svc-audit | Go | High-write append-only service; minimal logic; high throughput |
svc-documents | NestJS | File handling, integrations (ClamAV, S3) |
svc-webhooks | Go | Public-facing high-throughput endpoint; signature verification; needs to absorb traffic spikes |
svc-reports | NestJS | Read-side projections; shares types with frontends |
svc-reference | Go | Read-mostly with aggressive caching; tiny service |
This is a recommendation, not a hard rule. A team should not adopt a language just because the table says so. When we extract a service, the team building it picks the language with the engineering lead.
17.3 How three languages stay one platform
Different language, same contract — what makes polyglot survive is uniform contracts and uniform operations.
Contracts as the source of truth
- gRPC + Protocol Buffers for service-to-service. The
.protofiles live inlibs/contracts/and are codegen'd into:- TypeScript for Node services (
ts-proto) - Kotlin/Java for Spring Boot (
protoc-gen-grpc-java) - Go (
protoc-gen-go-grpc)
- TypeScript for Node services (
- JSON Schema + AsyncAPI for events on the bus. Codegen'd into each language's event types.
- OpenAPI 3.1 for external (gateway-facing) APIs. Codegen'd into client SDKs.
The proto / schema files are reviewed like code — they're the
real interface. Implementation can change freely; the contract
cannot, except by versioning (v1, v2).
Operations as the great equaliser
Everything below is uniform across languages:
- Containers: every service ships as a Docker image; same base hardening rules; same vulnerability scan.
- Kubernetes: same Helm chart template; same liveness/readiness/
startup probes; same
SIGTERMhandling. - Observability: all three languages have first-class OpenTelemetry SDKs for traces + metrics + logs. Service mesh injects sidecars; no language-specific monitoring.
- Secrets: Vault Agent sidecar mounts secrets to a tmpfs the app reads at startup. Same for all three languages.
- CI/CD: per-language pipeline shape (build, test, scan, sign, promote) but the promotion gates are the same: SAST passes, vulns under threshold, container scan clean, image signed (cosign), staging soak time elapsed.
- Auth: same OIDC for human users; same SPIFFE/SPIRE workload identity for service-to-service mTLS — issued by the same internal CA regardless of language.
- Logging format: same JSON schema with mandatory fields
(
trace_id,span_id,service,env,level,msg,attrs). Each language uses its native logger configured to emit this shape.
Language-specific guardrails
Each language gets its own lint / format / test conventions, owned by a small enabling team or a designated maintainer:
| Language | Build | Lint | Test | Coverage gate |
|---|---|---|---|---|
| TypeScript | pnpm workspaces, nx | ESLint + Prettier | Jest + Playwright | 80% lines on svc-* |
| Kotlin | Gradle (Kotlin DSL) | ktlint + detekt | JUnit5 + Testcontainers | 80% lines |
| Go | Go modules per-service | golangci-lint (strict) | go test + Testcontainers-go | 80% lines |
Local development
Devs run only the services they're working on, plus a docker- compose.local.yml that provides Postgres, Kafka/NATS, Redis,
mock-providers. Other services run as stubbed remote via mock
servers (Prism for OpenAPI, protomock for gRPC). Nobody runs the
full platform locally.
17.4 When NOT to mix languages
Be deliberate:
- Don't pick a language to learn it. Production fintech is not a playground. The team writing a service must already be senior in its language.
- Don't fragment the team. If you have one engineer who can write Spring Boot today, that's a bus factor of one. Either hire/upskill before committing the service to JVM, or use a language the team already knows well.
- Don't optimise prematurely. A NestJS service can do plenty of
throughput. Only move
svc-paymentsandsvc-ledgerto Go when you have actual profiling data showing Node is the bottleneck. - Don't pick exotic stacks (Rust, Elixir, Zig) — beautiful, but the talent market in Addis is thin and the integration libraries for banking aren't there yet. Java / Node / Go cover us cleanly.
17.5 Hiring & team shape
| Role | Approx team size for Phase 1+2 |
|---|---|
| Platform / infra engineer | 1 senior |
| Backend engineers (Node) | 2 |
| Backend engineers (Go) | 2 senior — for svc-payments, svc-ledger, svc-webhooks, svc-audit |
| Backend engineer (JVM) | 1–2 senior — for svc-bank-adapters, svc-lending, svc-risk |
| Security engineer | 1 (can be fractional initially) |
| QA / SDET | 1 |
| SRE / on-call | 1 (rotating with backend engineers initially) |
Cross-training plan: every backend engineer should be able to read all three languages within 6 months. Writing them is optional.
18. Security parameters — what the bank questionnaire will ask
When a bank, MFI, or wallet operator gives us their security
questionnaire (and they all do), this is the answer set. We
maintain a separate, bank-handover-ready version of this in
SECURITY_CONTROLS.md — the file is
structured Q&A so it can be copy-pasted into vendor questionnaires
with minimal editing.
The headline controls we'll be asked about, and our position:
| Domain | Control | Position |
|---|---|---|
| Encryption at rest | AES-256-GCM, KMS-managed keys, per-tenant DEKs, automatic rotation | Built-in |
| Encryption in transit | TLS 1.3 only, modern cipher suites only, HSTS + HPKP where applicable | Enforced |
| Authentication (user) | Phone + Argon2id + mandatory MFA (TOTP / WebAuthn) | Implemented |
| Authentication (service-to-service) | mTLS via SPIFFE/SPIRE, short-lived (24 h) certs | Enforced |
| Authorization | RBAC + OPA policy engine; least privilege | Enforced |
| Webhook signing | HMAC-SHA256 or detached JWS; constant-time verify; nonce + timestamp | Mandatory |
| Idempotency | UUIDv4 keys, 24 h TTL, response cached | Built-in |
| Audit log | WORM (Postgres + S3 Object Lock), 7-year retention, Merkle-rooted daily, SIEM-streamed | Implemented |
| Secrets | HashiCorp Vault + dynamic per-service secrets, no env files in production | Mandatory |
| Vulnerability mgmt | SAST + SCA + container scan in CI; Critical 48 h, High 7 d SLA | Process |
| Penetration testing | External independent test pre-launch and annually; segment-rescoped on major change | Scheduled |
| Bug bounty / responsible disclosure | security.txt + private programme via HackerOne or local equiv | Planned |
| Network | WAF + DDoS at edge, private VPC, no service public-by-default, egress allow-list | Architecture |
| Data residency | Customer + transaction data resident in Ethiopia (cloud region or in-country DC) | Policy |
| Backup & recovery | 5-min Postgres PITR, 30-d hot, 7-y cold, monthly restore drill | Tested |
| BCP / DR | Multi-AZ active-active, DR region async, quarterly failover drill, RPO 1 min / RTO 5 min on critical | Tested |
| Incident response | Documented playbook, 15-min ack SLA for SEV1, post-mortem within 5 business days | Process |
| Change management | All prod changes via PR + 2 reviewers + signed image promotion + canary | Enforced |
| Vendor / third-party risk | Vendor security review (signed DPA, SOC 2 if available), annual re-review | Process |
| Compliance roadmap | ISO 27001 within 18 months; SOC 2 Type II at scale; PCI DSS scope minimised through certified processors | Roadmap |
| Right to audit | Per contract, with reasonable notice | Available |
For the detailed control-by-control answers banks expect (the
ones that fit in their spreadsheet), see SECURITY_CONTROLS.md.
Sign-off
This plan is a draft. Before we commit engineering time, the following sign-offs are needed:
- Engineering lead — architecture, sequencing
- Security lead — threat model, controls
- Compliance / legal — NBE alignment, AML programme, data residency
- Product — phase ordering vs commercial commitments
- Finance — infra and tooling budget
Once signed off, this document becomes the canonical reference and every architectural deviation requires a one-page RFC referencing the section it changes.