LONG_TERM_IAM_ARCHITECTURE.md — DemozPay

Companion to AUTH_SYSTEM_REVIEW.md and AUTH_RISK_MATRIX.md. Covers: the 4-option comparison (Task 4), the keep/migrate/fork/replace/isolate decision (Task 6), and the ideal target IAM (Task 7).

4. The four options, compared

Scoring 1–5 (5 = best for a regulated multi-tenant fintech).

Criterion	A — Keep + harden	B — Fork Better Auth	C — Custom from scratch	D — Enterprise IAM (Ory / Keycloak / Auth0)
Architecture fit (modular monolith, TS)	5	4	4	3 (adds an external dependency / network hop)
Fintech suitability	3	3	2	5
Security posture	3	3	2	5
Auditability	2	3 (you can add hooks)	3 (you own it)	5
Tenancy support	4 (org plugin + RLS)	4	3	4 (Keycloak realms / Ory projects need mapping)
RBAC flexibility	2 (3 coarse roles)	3	5 (anything you build)	5 (Ory Keto / Keycloak authz)
Scalability	3 (DB sessions)	3	3	5
Operational complexity	5 (lowest — it's in-process)	3	2	2 (run/patch an IAM cluster)
Lock-in risk	4 (mitigated by abstraction)	3 (own fork)	5 (none)	2 (Auth0) / 4 (Ory/Keycloak OSS)
Migration complexity (to adopt)	5 (already there)	4	1	2
Engineering burden (to reach target)	Low (weeks of hardening)	Med-High (own + maintain fork)	Very High (6–9 mo to parity)	Med-High (2–4 mo migration)
Regulator perception	Neutral	Neutral	Negative	Positive
Partner-bank perception	Neutral→Positive after hardening	Neutral	Negative	Positive
Recommended stage	Pilot + post-pilot	Avoid unless forced	Avoid	Scale-up / enterprise (triggered)

Narrative per option

A — Keep Better Auth and harden around it. ✅ Recommended now. Lowest risk, fastest, in-process (no new infra). The catch: you must consciously harden the surrounding controls (S2S auth, admin MFA, rate limiting, OTP provider, auth-event log) — none of which Better Auth does for you, and none of which a different library would either. Pair with the isolation abstraction (§6) so A is not a dead end.

B — Fork Better Auth internally. ⚠️ Avoid unless forced. You'd fork only to (a) patch a CVE upstream won't, or (b) hold back a breaking change. Both are reactive triggers, not a strategy. A fork means you own security patches forever and lose easy upstream upgrades. Keep it in your back pocket as a contingency, not a plan.

C — Build custom auth from scratch. ❌ Avoid. Six-to-nine months to reach parity with what you already run, re-opening every account-takeover bug the ecosystem solved. "We rolled our own auth" is a negative signal to bank TPRM teams. The only thing you should custom-build is the thin abstraction glue and the auth-event log — never the primitives. See review §3.

D — Move to a dedicated IAM. ✅ The eventual destination, when triggered.

Ory (Kratos + Hydra + Keto) — best architectural fit: API-first, self-hosted, Go (matches your two-language ceiling on the infra side), Keto gives real ABAC/ReBAC. Highest operational cost but lowest lock-in.
Keycloak — mature, batteries-included, realms ≈ tenants, strong OIDC/SAML for employer SSO. JVM operational footprint; realm-per-tenant scaling needs care.
Auth0 / WorkOS — fastest to enterprise SSO + SCIM, best for "land an enterprise employer this quarter." Highest lock-in and per-MAU cost — dangerous for a high-volume, low-ARPU emerging-market user base (salary workers). Consider WorkOS/Auth0 for the B2B employer-admin realm and self-hosted for the consumer/employee realm.

Decision triggers that should flip A → D (any one):

First partner-bank security questionnaire demanding attested IAM / SSO / SCIM.
First enterprise employer demanding SAML/OIDC SSO for their admins.
Multi-region requirement (session federation pain, R14).
MFA/identity-assurance requirements (e.g. NBE/regulator) beyond what plugins cover.
RBAC needs exceed coarse roles (separation-of-duties across payroll/lending/reconciliation).

6. The decision: keep, migrate, fork, replace, or isolate?

Decision: ISOLATE NOW, KEEP for pilot + post-pilot, MIGRATE later only on a trigger. Do not fork. Do not rewrite.

Concretely:

Keep Better Auth as the credential + session + org engine through pilot and early production.
Isolate it behind a thin internal IdentityProvider port immediately (this is the single most important architectural move — it makes A reversible and de-risks every future option). This is consistent with the repo's own ports-and-adapters discipline ([[project_ewa_canonical_wiring]] pattern).
Migrate to a dedicated IAM (likely Ory, plus WorkOS/Auth0 for enterprise-employer SSO) at the scale-up gate, only when a concrete trigger above fires.
Do not fork — reactive contingency only.
Do not rewrite — no requirement justifies the risk.

The isolation abstraction (build this now)

The goal: domain + application code never imports better-auth and never reads a Better-Auth-shaped object. They depend on your contracts. Today the leak points are session.middleware.ts (imports better-auth/node) and better-auth.factory.ts. Quarantine both behind:

packages/shared/identity/           ← NEW: vendor-neutral identity contracts (no better-auth import)
  IdentityProvider (port)           resolveSession(headers) → Principal | null
                                    issueChallenge / verifyOtp / ...
  Principal                         { userId, email, emailVerified, tenantId?, mfaLevel, deviceId? }
  AuthorizationService (port)       can(principal, action, resource, ctx) → boolean   (ABAC-ready)
  AuthEventSink (port)              record(authEvent)   ← immutable audit, IDP-independent

apps/api/src/identity/auth/
  better-auth.identity-provider.ts  ← the ONLY file that imports better-auth (adapter)
  better-auth.factory.ts            (unchanged, hidden behind the adapter)

Everything else (SessionMiddleware, guards, controllers) depends on @demoz-pay/shared-identity, not on Better Auth. Swapping to Ory later = writing one new adapter (ory.identity-provider.ts) and flipping a binding. This converts the §D migration matrix from "rewrite" to "new adapter + reconciliation."

7. Ideal long-term fintech IAM architecture

Design principles: default-deny · least privilege · zero-trust between services · every privileged action is attributable and tamper-evident · identity assurance scales with money risk.

7.1 Identity provider architecture — multi-realm

Do not put salary-worker consumers, employer admins, and platform operators in one undifferentiated identity pool. Separate realms with separate policies:

┌─────────────────────────────────────────────────────────────────────────┐
│ REALM 1 — Consumer (employees)        primary: phone-OTP, optional passkey │
│   assurance: phone-verified; step-up TOTP/passkey for money actions        │
├─────────────────────────────────────────────────────────────────────────┤
│ REALM 2 — Employer / FI / Merchant admins   email+password+MFA, or SSO     │
│   enterprise tenants: OIDC/SAML federation + SCIM provisioning              │
├─────────────────────────────────────────────────────────────────────────┤
│ REALM 3 — Platform operators (DemozPay staff)   MFA MANDATORY + step-up     │
│   separate IDP/realm, hardware-key preferred, all actions = audit + dual-   │
│   control for high-risk ops (KYC override, manual ledger adjustment)        │
├─────────────────────────────────────────────────────────────────────────┤
│ REALM 4 — Machines (services, partners)   mTLS + workload identity / OAuth  │
│   client-credentials; NO human credentials ever                            │
└─────────────────────────────────────────────────────────────────────────┘

Behind the IdentityProvider port, realms can be served by different engines (e.g. self-hosted Ory Kratos for Realm 1+3, WorkOS/Auth0 for Realm 2 enterprise SSO) without the domain code knowing.

7.2 Session architecture

Consumer/admin web: opaque, server-side, revocable sessions (what Better Auth gives you) — keep. Add a short-TTL session cache (Redis) in front of the DB lookup to kill R9; cache holds {principal, expiresAt}, invalidated on logout/revoke.
Idle + absolute timeouts: distinct (e.g. 30-min idle, 7-day absolute) — money apps should not have week-long idle sessions.
Step-up: a money-moving action requires a session whose mfaLevel/authTime satisfies a freshness + MFA policy; otherwise re-challenge. Encode as a claim on the Principal.
"Log out everywhere" + device list: session rows already carry ipAddress/userAgent; expose management UI.

7.3 RBAC → ABAC model

Today: 3 coarse org roles. Target: roles for coarse gating + attribute/policy for fine decisions.

Principal{ userId, tenantId, realm, roles[], mfaLevel, deviceTrust, authTime }
Resource{ type, tenantId, ownerId, amount, status }
Policy(action, principal, resource, env) → permit | deny
  e.g. permit "loan.disburse" IF principal.role in {owner,finance}
        AND principal.tenantId == resource.tenantId
        AND principal.mfaLevel >= 2
        AND resource.amount <= principal.approvalLimit
        AND NOT sameActor(principal, resource.requestedBy)   ← separation of duties

Engine options: Ory Keto (ReBAC), OPA/Cedar (policy-as-code). Keep the AuthorizationService.can(...) port so the engine is swappable. Separation-of-duties (maker/checker) is a regulatory must for payroll disbursement and manual ledger adjustment — coarse roles can't express it.

7.4 Organization / employee–business relationship model

Keep the elegant invariant: Organization.id == Business.id, Member(user, org, role). Extend:

An employee belongs to a Business via Employee profile and may have a Member row (for self-service). Distinguish "employee of" (payroll subject) from "member of" (admin actor) — they are different relationships and must not be conflated in authorization.
A user may be a Member of several orgs (accountant serving multiple employers); activeOrganizationId gates the request tenant, RLS enforces at DB. This already works — preserve it.

7.5 Audit attribution model (compliance-grade)

Introduce an append-only, tamper-evident AuthEvent log, separate from mutable Session, written in the same spirit as the financial outbox/audit (ADR-008):

AuthEvent{ id, ts, actorUserId, onBehalfOf?, realm, tenantId?, eventType,
           ip, deviceId, mfaLevel, result, requestId, prevHash, hash }
  eventTypes: LOGIN, LOGOUT, MFA_ENROLL, MFA_CHALLENGE, ROLE_GRANT, ROLE_REVOKE,
              ADMIN_ACTION, IMPERSONATION_START/END, PASSWORD_RESET, KEY_ROTATION

Hash-chain (prevHash) makes tampering detectable. This is the artifact a regulator/bank asks for and the current system lacks (R6).

7.6 Machine-to-machine auth (fixes R1)

API ↔ ledger / integration-gateway: mTLS (SPIFFE/SPIRE workload identity or, minimally, a private CA issuing per-service certs) + a short-lived service token in gRPC metadata. Replace every credentials.createInsecure(). This is non-negotiable for "ledger is the money truth."
Authorization on the ledger side: the ledger must verify which service is calling and reject unexpected callers (defense in depth — don't assume network isolation).

7.7 Bank-partner auth

Inbound webhooks: keep HMAC, add a nonce/jti store (Redis with TTL = skew window) to make replays single-use (fixes R7). Per-partner signing keys in the secrets manager, rotatable.
Outbound to banks: mTLS + per-partner client credentials (OAuth client-credentials or signed requests), partner credentials never in env files — in a secrets manager (§7.13) with rotation.

7.8 Webhook auth (general)

Standardize the ${timestamp}.${nonce}.${body} HMAC + nonce-store + per-sender key as the one webhook auth primitive (the code comment already anticipates reuse). Document it as a control in SECURITY_CONTROLS.md.

7.9 MFA architecture (fixes R8, R2)

Wire the Better Auth TOTP plugin now (table already exists). Backup codes hashed.
Mandatory MFA for Realm 3 (operators) and for employer/FI admins doing money operations.
Step-up for high-risk consumer actions (large EWA, loan acceptance) even on phone-OTP accounts (re-OTP or passkey).
Roadmap to WebAuthn/passkeys for admins (phishing-resistant) — the real win for a fintech.

7.10 KYC linkage

KYC is an identity-assurance attribute, not an auth gate: Principal.identityAssurance ∈ {none, phone, kyc-basic, kyc-full, sanctions-cleared}. Authorization policies reference it (e.g. loan disbursement requires kyc-full + sanctions-cleared). Keep KYC state on the domain side (Business.kycVerified, future per-employee KYC), surface it as a claim. Future AML/sanctions screening becomes another assurance attribute + an AuthEvent/audit record.

7.11 Device trust

Record deviceId (cookie/secure-storage bound) on first auth; treat new device as lower trust → step-up. Feed deviceTrust into ABAC. Especially valuable for the consumer phone-OTP realm where SIM-swap is a real Ethiopian-market threat.

7.12 Admin / operator auth + support impersonation controls (fixes R2)

Operators in Realm 3, MFA-mandatory, ideally hardware keys.
No standing cross-tenant superuser. Replace the unconditional checkType==='PLATFORM' bypass with scoped, time-boxed, dual-controlled grants: an operator requests access to tenant X for a reason; it's approved (or auto-approved for low-risk read), expires, and is fully audited.
Impersonation ("support as user"): explicit IMPERSONATION_START/END AuthEvents, banner in UI, read-mostly by default, write requires elevated approval, never silent. The current code has no impersonation concept — design it before someone builds an ad-hoc backdoor.
Reconciliation operator controls: reconciliation/ledger-adjustment is the highest-risk operator surface. Maker/checker (dual control), every adjustment an AuthEvent + financial audit entry, scoped role distinct from general platform-admin.

7.13 Secrets-management interaction

BETTER_AUTH_SECRET, BANK_WEBHOOK_SIGNING_KEY, per-partner keys, service mTLS certs → a secrets manager (Vault / cloud KMS), not env files. Rotation runbooks. Key rotation emits KEY_ROTATION AuthEvents.
Session-signing/secret rotation must support overlap (old+new valid during rotation) to avoid mass logout.

7.14 Service-to-service auth (internal)

Every internal call (API↔Go services, future domain services) authenticates with workload identity (mTLS/SPIFFE). Zero implicit trust from "same network." The ledger, as money truth, authorizes its callers.

7.15 Zero-trust direction

Target end-state, incrementally:

No network position grants trust; every hop authenticates (mTLS) and authorizes (policy).
Every principal (human or machine) carries a verifiable identity + assurance level.
Every privileged action is least-privilege, time-boxed, and tamper-evidently audited.
Money-risk drives identity-assurance and MFA-freshness requirements (risk-adaptive auth).

Target architecture diagram

                         ┌────────────── IDENTITY PLANE ──────────────┐
 Consumers (phone-OTP) ──▶ Realm 1 ─┐                                  │
 Employer/FI admins ─────▶ Realm 2 ─┤  IdentityProvider PORT           │  ← domain code depends on PORT,
 Operators (MFA+keys) ───▶ Realm 3 ─┤  (better-auth adapter today,     │     never on the vendor
 Services / partners ────▶ Realm 4 ─┘   ory/workos adapter later)      │
                         └───────────────────┬────────────────────────┘
                                             │ Principal{ id, tenant, roles, mfaLevel, assurance, device }
                         ┌───────────────────▼────────────────────────┐
                         │ AuthN gate → AuthZ (ABAC/ReBAC policy) →     │
                         │ tenant ALS → RLS  (+ step-up on money risk)  │
                         └───────────────────┬────────────────────────┘
                                             │ every privileged action →
                         ┌───────────────────▼────────────────────────┐
                         │ AuthEvent log (append-only, hash-chained)   │  ← regulator/bank artifact
                         └─────────────────────────────────────────────┘
   API ◀── mTLS + svc-token ──▶ ledger (Go)        partner banks ◀── mTLS + HMAC(+nonce store) ──▶ gateway
   secrets/keys ◀── Vault/KMS (rotation, KEY_ROTATION events) ──▶ all of the above

This is the destination. You do not build it all now — you build the ports now (§6), close the 🔴 risks (matrix R1–R4) next, and grow into the rest at the scale-up gate. Sequencing is in AUTH_MIGRATION_STRATEGY.md.

4. The four options, compared​

Narrative per option​

6. The decision: keep, migrate, fork, replace, or isolate?​

The isolation abstraction (build this now)​

7. Ideal long-term fintech IAM architecture​

7.1 Identity provider architecture — multi-realm​

7.2 Session architecture​

7.3 RBAC → ABAC model​

7.4 Organization / employee–business relationship model​

7.5 Audit attribution model (compliance-grade)​

7.6 Machine-to-machine auth (fixes R1)​

7.7 Bank-partner auth​

7.8 Webhook auth (general)​

7.9 MFA architecture (fixes R8, R2)​

7.10 KYC linkage​

7.11 Device trust​

7.12 Admin / operator auth + support impersonation controls (fixes R2)​

7.13 Secrets-management interaction​

7.14 Service-to-service auth (internal)​

7.15 Zero-trust direction​

Target architecture diagram​