LONG_TERM_IAM_ARCHITECTURE.md — DemozPay
Companion to
AUTH_SYSTEM_REVIEW.mdandAUTH_RISK_MATRIX.md. Covers: the 4-option comparison (Task 4), the keep/migrate/fork/replace/isolate decision (Task 6), and the ideal target IAM (Task 7).
4. The four options, compared
Scoring 1–5 (5 = best for a regulated multi-tenant fintech).
| Criterion | A — Keep + harden | B — Fork Better Auth | C — Custom from scratch | D — Enterprise IAM (Ory / Keycloak / Auth0) |
|---|---|---|---|---|
| Architecture fit (modular monolith, TS) | 5 | 4 | 4 | 3 (adds an external dependency / network hop) |
| Fintech suitability | 3 | 3 | 2 | 5 |
| Security posture | 3 | 3 | 2 | 5 |
| Auditability | 2 | 3 (you can add hooks) | 3 (you own it) | 5 |
| Tenancy support | 4 (org plugin + RLS) | 4 | 3 | 4 (Keycloak realms / Ory projects need mapping) |
| RBAC flexibility | 2 (3 coarse roles) | 3 | 5 (anything you build) | 5 (Ory Keto / Keycloak authz) |
| Scalability | 3 (DB sessions) | 3 | 3 | 5 |
| Operational complexity | 5 (lowest — it's in-process) | 3 | 2 | 2 (run/patch an IAM cluster) |
| Lock-in risk | 4 (mitigated by abstraction) | 3 (own fork) | 5 (none) | 2 (Auth0) / 4 (Ory/Keycloak OSS) |
| Migration complexity (to adopt) | 5 (already there) | 4 | 1 | 2 |
| Engineering burden (to reach target) | Low (weeks of hardening) | Med-High (own + maintain fork) | Very High (6–9 mo to parity) | Med-High (2–4 mo migration) |
| Regulator perception | Neutral | Neutral | Negative | Positive |
| Partner-bank perception | Neutral→Positive after hardening | Neutral | Negative | Positive |
| Recommended stage | Pilot + post-pilot | Avoid unless forced | Avoid | Scale-up / enterprise (triggered) |
Narrative per option
A — Keep Better Auth and harden around it. ✅ Recommended now. Lowest risk, fastest, in-process (no new infra). The catch: you must consciously harden the surrounding controls (S2S auth, admin MFA, rate limiting, OTP provider, auth-event log) — none of which Better Auth does for you, and none of which a different library would either. Pair with the isolation abstraction (§6) so A is not a dead end.
B — Fork Better Auth internally. ⚠️ Avoid unless forced. You'd fork only to (a) patch a CVE upstream won't, or (b) hold back a breaking change. Both are reactive triggers, not a strategy. A fork means you own security patches forever and lose easy upstream upgrades. Keep it in your back pocket as a contingency, not a plan.
C — Build custom auth from scratch. ❌ Avoid. Six-to-nine months to reach parity with what you already run, re-opening every account-takeover bug the ecosystem solved. "We rolled our own auth" is a negative signal to bank TPRM teams. The only thing you should custom-build is the thin abstraction glue and the auth-event log — never the primitives. See review §3.
D — Move to a dedicated IAM. ✅ The eventual destination, when triggered.
- Ory (Kratos + Hydra + Keto) — best architectural fit: API-first, self-hosted, Go (matches your two-language ceiling on the infra side), Keto gives real ABAC/ReBAC. Highest operational cost but lowest lock-in.
- Keycloak — mature, batteries-included, realms ≈ tenants, strong OIDC/SAML for employer SSO. JVM operational footprint; realm-per-tenant scaling needs care.
- Auth0 / WorkOS — fastest to enterprise SSO + SCIM, best for "land an enterprise employer this quarter." Highest lock-in and per-MAU cost — dangerous for a high-volume, low-ARPU emerging-market user base (salary workers). Consider WorkOS/Auth0 for the B2B employer-admin realm and self-hosted for the consumer/employee realm.
Decision triggers that should flip A → D (any one):
- First partner-bank security questionnaire demanding attested IAM / SSO / SCIM.
- First enterprise employer demanding SAML/OIDC SSO for their admins.
- Multi-region requirement (session federation pain, R14).
- MFA/identity-assurance requirements (e.g. NBE/regulator) beyond what plugins cover.
- RBAC needs exceed coarse roles (separation-of-duties across payroll/lending/reconciliation).
6. The decision: keep, migrate, fork, replace, or isolate?
Decision: ISOLATE NOW, KEEP for pilot + post-pilot, MIGRATE later only on a trigger. Do not fork. Do not rewrite.
Concretely:
- Keep Better Auth as the credential + session + org engine through pilot and early production.
- Isolate it behind a thin internal
IdentityProviderport immediately (this is the single most important architectural move — it makes A reversible and de-risks every future option). This is consistent with the repo's own ports-and-adapters discipline ([[project_ewa_canonical_wiring]] pattern). - Migrate to a dedicated IAM (likely Ory, plus WorkOS/Auth0 for enterprise-employer SSO) at the scale-up gate, only when a concrete trigger above fires.
- Do not fork — reactive contingency only.
- Do not rewrite — no requirement justifies the risk.
The isolation abstraction (build this now)
The goal: domain + application code never imports better-auth and never reads a Better-Auth-shaped object. They depend on your contracts. Today the leak points are session.middleware.ts (imports better-auth/node) and better-auth.factory.ts. Quarantine both behind:
packages/shared/identity/ ← NEW: vendor-neutral identity contracts (no better-auth import)
IdentityProvider (port) resolveSession(headers) → Principal | null
issueChallenge / verifyOtp / ...
Principal { userId, email, emailVerified, tenantId?, mfaLevel, deviceId? }
AuthorizationService (port) can(principal, action, resource, ctx) → boolean (ABAC-ready)
AuthEventSink (port) record(authEvent) ← immutable audit, IDP-independent
apps/api/src/identity/auth/
better-auth.identity-provider.ts ← the ONLY file that imports better-auth (adapter)
better-auth.factory.ts (unchanged, hidden behind the adapter)
Everything else (SessionMiddleware, guards, controllers) depends on @demoz-pay/shared-identity, not on Better Auth. Swapping to Ory later = writing one new adapter (ory.identity-provider.ts) and flipping a binding. This converts the §D migration matrix from "rewrite" to "new adapter + reconciliation."
7. Ideal long-term fintech IAM architecture
Design principles: default-deny · least privilege · zero-trust between services · every privileged action is attributable and tamper-evident · identity assurance scales with money risk.
7.1 Identity provider architecture — multi-realm
Do not put salary-worker consumers, employer admins, and platform operators in one undifferentiated identity pool. Separate realms with separate policies:
┌─────────────────────────────────────────────────────────────────────────┐
│ REALM 1 — Consumer (employees) primary: phone-OTP, optional passkey │
│ assurance: phone-verified; step-up TOTP/passkey for money actions │
├─────────────────────────────────────────────────────────────────────────┤
│ REALM 2 — Employer / FI / Merchant admins email+password+MFA, or SSO │
│ enterprise tenants: OIDC/SAML federation + SCIM provisioning │
├─────────────────────────────────────────────────────────────────────────┤
│ REALM 3 — Platform operators (DemozPay staff) MFA MANDATORY + step-up │
│ separate IDP/realm, hardware-key preferred, all actions = audit + dual- │
│ control for high-risk ops (KYC override, manual ledger adjustment) │
├─────────────────────────────────────────────────────────────────────────┤
│ REALM 4 — Machines (services, partners) mTLS + workload identity / OAuth │
│ client-credentials; NO human credentials ever │
└─────────────────────────────────────────────────────────────────────────┘
Behind the IdentityProvider port, realms can be served by different engines (e.g. self-hosted Ory Kratos for Realm 1+3, WorkOS/Auth0 for Realm 2 enterprise SSO) without the domain code knowing.
7.2 Session architecture
- Consumer/admin web: opaque, server-side, revocable sessions (what Better Auth gives you) — keep. Add a short-TTL session cache (Redis) in front of the DB lookup to kill R9; cache holds
{principal, expiresAt}, invalidated on logout/revoke. - Idle + absolute timeouts: distinct (e.g. 30-min idle, 7-day absolute) — money apps should not have week-long idle sessions.
- Step-up: a money-moving action requires a session whose
mfaLevel/authTimesatisfies a freshness + MFA policy; otherwise re-challenge. Encode as a claim on thePrincipal. - "Log out everywhere" + device list: session rows already carry
ipAddress/userAgent; expose management UI.
7.3 RBAC → ABAC model
Today: 3 coarse org roles. Target: roles for coarse gating + attribute/policy for fine decisions.
Principal{ userId, tenantId, realm, roles[], mfaLevel, deviceTrust, authTime }
Resource{ type, tenantId, ownerId, amount, status }
Policy(action, principal, resource, env) → permit | deny
e.g. permit "loan.disburse" IF principal.role in {owner,finance}
AND principal.tenantId == resource.tenantId
AND principal.mfaLevel >= 2
AND resource.amount <= principal.approvalLimit
AND NOT sameActor(principal, resource.requestedBy) ← separation of duties
Engine options: Ory Keto (ReBAC), OPA/Cedar (policy-as-code). Keep the AuthorizationService.can(...) port so the engine is swappable. Separation-of-duties (maker/checker) is a regulatory must for payroll disbursement and manual ledger adjustment — coarse roles can't express it.
7.4 Organization / employee–business relationship model
Keep the elegant invariant: Organization.id == Business.id, Member(user, org, role). Extend:
- An employee belongs to a
BusinessviaEmployeeprofile and may have aMemberrow (for self-service). Distinguish "employee of" (payroll subject) from "member of" (admin actor) — they are different relationships and must not be conflated in authorization. - A user may be a
Memberof several orgs (accountant serving multiple employers);activeOrganizationIdgates the request tenant, RLS enforces at DB. This already works — preserve it.
7.5 Audit attribution model (compliance-grade)
Introduce an append-only, tamper-evident AuthEvent log, separate from mutable Session, written in the same spirit as the financial outbox/audit (ADR-008):
AuthEvent{ id, ts, actorUserId, onBehalfOf?, realm, tenantId?, eventType,
ip, deviceId, mfaLevel, result, requestId, prevHash, hash }
eventTypes: LOGIN, LOGOUT, MFA_ENROLL, MFA_CHALLENGE, ROLE_GRANT, ROLE_REVOKE,
ADMIN_ACTION, IMPERSONATION_START/END, PASSWORD_RESET, KEY_ROTATION
Hash-chain (prevHash) makes tampering detectable. This is the artifact a regulator/bank asks for and the current system lacks (R6).
7.6 Machine-to-machine auth (fixes R1)
- API ↔ ledger / integration-gateway: mTLS (SPIFFE/SPIRE workload identity or, minimally, a private CA issuing per-service certs) + a short-lived service token in gRPC metadata. Replace every
credentials.createInsecure(). This is non-negotiable for "ledger is the money truth." - Authorization on the ledger side: the ledger must verify which service is calling and reject unexpected callers (defense in depth — don't assume network isolation).
7.7 Bank-partner auth
- Inbound webhooks: keep HMAC, add a nonce/jti store (Redis with TTL = skew window) to make replays single-use (fixes R7). Per-partner signing keys in the secrets manager, rotatable.
- Outbound to banks: mTLS + per-partner client credentials (OAuth client-credentials or signed requests), partner credentials never in env files — in a secrets manager (§7.13) with rotation.
7.8 Webhook auth (general)
Standardize the ${timestamp}.${nonce}.${body} HMAC + nonce-store + per-sender key as the one webhook auth primitive (the code comment already anticipates reuse). Document it as a control in SECURITY_CONTROLS.md.
7.9 MFA architecture (fixes R8, R2)
- Wire the Better Auth TOTP plugin now (table already exists). Backup codes hashed.
- Mandatory MFA for Realm 3 (operators) and for employer/FI admins doing money operations.
- Step-up for high-risk consumer actions (large EWA, loan acceptance) even on phone-OTP accounts (re-OTP or passkey).
- Roadmap to WebAuthn/passkeys for admins (phishing-resistant) — the real win for a fintech.
7.10 KYC linkage
KYC is an identity-assurance attribute, not an auth gate: Principal.identityAssurance ∈ {none, phone, kyc-basic, kyc-full, sanctions-cleared}. Authorization policies reference it (e.g. loan disbursement requires kyc-full + sanctions-cleared). Keep KYC state on the domain side (Business.kycVerified, future per-employee KYC), surface it as a claim. Future AML/sanctions screening becomes another assurance attribute + an AuthEvent/audit record.
7.11 Device trust
Record deviceId (cookie/secure-storage bound) on first auth; treat new device as lower trust → step-up. Feed deviceTrust into ABAC. Especially valuable for the consumer phone-OTP realm where SIM-swap is a real Ethiopian-market threat.
7.12 Admin / operator auth + support impersonation controls (fixes R2)
- Operators in Realm 3, MFA-mandatory, ideally hardware keys.
- No standing cross-tenant superuser. Replace the unconditional
checkType==='PLATFORM'bypass with scoped, time-boxed, dual-controlled grants: an operator requests access to tenant X for a reason; it's approved (or auto-approved for low-risk read), expires, and is fully audited. - Impersonation ("support as user"): explicit
IMPERSONATION_START/ENDAuthEvents, banner in UI, read-mostly by default, write requires elevated approval, never silent. The current code has no impersonation concept — design it before someone builds an ad-hoc backdoor. - Reconciliation operator controls: reconciliation/ledger-adjustment is the highest-risk operator surface. Maker/checker (dual control), every adjustment an
AuthEvent+ financial audit entry, scoped role distinct from general platform-admin.
7.13 Secrets-management interaction
BETTER_AUTH_SECRET,BANK_WEBHOOK_SIGNING_KEY, per-partner keys, service mTLS certs → a secrets manager (Vault / cloud KMS), not env files. Rotation runbooks. Key rotation emitsKEY_ROTATIONAuthEvents.- Session-signing/secret rotation must support overlap (old+new valid during rotation) to avoid mass logout.
7.14 Service-to-service auth (internal)
Every internal call (API↔Go services, future domain services) authenticates with workload identity (mTLS/SPIFFE). Zero implicit trust from "same network." The ledger, as money truth, authorizes its callers.
7.15 Zero-trust direction
Target end-state, incrementally:
- No network position grants trust; every hop authenticates (mTLS) and authorizes (policy).
- Every principal (human or machine) carries a verifiable identity + assurance level.
- Every privileged action is least-privilege, time-boxed, and tamper-evidently audited.
- Money-risk drives identity-assurance and MFA-freshness requirements (risk-adaptive auth).
Target architecture diagram
┌────────────── IDENTITY PLANE ──────────────┐
Consumers (phone-OTP) ──▶ Realm 1 ─┐ │
Employer/FI admins ─────▶ Realm 2 ─┤ IdentityProvider PORT │ ← domain code depends on PORT,
Operators (MFA+keys) ───▶ Realm 3 ─┤ (better-auth adapter today, │ never on the vendor
Services / partners ────▶ Realm 4 ─┘ ory/workos adapter later) │
└───────────────────┬────────────────────────┘
│ Principal{ id, tenant, roles, mfaLevel, assurance, device }
┌───────────────────▼────────────────────────┐
│ AuthN gate → AuthZ (ABAC/ReBAC policy) → │
│ tenant ALS → RLS (+ step-up on money risk) │
└───────────────────┬────────────────────────┘
│ every privileged action →
┌───────────────────▼────────────────────────┐
│ AuthEvent log (append-only, hash-chained) │ ← regulator/bank artifact
└─────────────────────────────────────────────┘
API ◀── mTLS + svc-token ──▶ ledger (Go) partner banks ◀── mTLS + HMAC(+nonce store) ──▶ gateway
secrets/keys ◀── Vault/KMS (rotation, KEY_ROTATION events) ──▶ all of the above
This is the destination. You do not build it all now — you build the ports now (§6), close the 🔴 risks (matrix R1–R4) next, and grow into the rest at the scale-up gate. Sequencing is in AUTH_MIGRATION_STRATEGY.md.