Skip to main content

LONG_TERM_IAM_ARCHITECTURE.md — DemozPay

Companion to AUTH_SYSTEM_REVIEW.md and AUTH_RISK_MATRIX.md. Covers: the 4-option comparison (Task 4), the keep/migrate/fork/replace/isolate decision (Task 6), and the ideal target IAM (Task 7).


4. The four options, compared

Scoring 1–5 (5 = best for a regulated multi-tenant fintech).

CriterionA — Keep + hardenB — Fork Better AuthC — Custom from scratchD — Enterprise IAM (Ory / Keycloak / Auth0)
Architecture fit (modular monolith, TS)5443 (adds an external dependency / network hop)
Fintech suitability3325
Security posture3325
Auditability23 (you can add hooks)3 (you own it)5
Tenancy support4 (org plugin + RLS)434 (Keycloak realms / Ory projects need mapping)
RBAC flexibility2 (3 coarse roles)35 (anything you build)5 (Ory Keto / Keycloak authz)
Scalability3 (DB sessions)335
Operational complexity5 (lowest — it's in-process)322 (run/patch an IAM cluster)
Lock-in risk4 (mitigated by abstraction)3 (own fork)5 (none)2 (Auth0) / 4 (Ory/Keycloak OSS)
Migration complexity (to adopt)5 (already there)412
Engineering burden (to reach target)Low (weeks of hardening)Med-High (own + maintain fork)Very High (6–9 mo to parity)Med-High (2–4 mo migration)
Regulator perceptionNeutralNeutralNegativePositive
Partner-bank perceptionNeutral→Positive after hardeningNeutralNegativePositive
Recommended stagePilot + post-pilotAvoid unless forcedAvoidScale-up / enterprise (triggered)

Narrative per option

A — Keep Better Auth and harden around it. ✅ Recommended now. Lowest risk, fastest, in-process (no new infra). The catch: you must consciously harden the surrounding controls (S2S auth, admin MFA, rate limiting, OTP provider, auth-event log) — none of which Better Auth does for you, and none of which a different library would either. Pair with the isolation abstraction (§6) so A is not a dead end.

B — Fork Better Auth internally. ⚠️ Avoid unless forced. You'd fork only to (a) patch a CVE upstream won't, or (b) hold back a breaking change. Both are reactive triggers, not a strategy. A fork means you own security patches forever and lose easy upstream upgrades. Keep it in your back pocket as a contingency, not a plan.

C — Build custom auth from scratch. ❌ Avoid. Six-to-nine months to reach parity with what you already run, re-opening every account-takeover bug the ecosystem solved. "We rolled our own auth" is a negative signal to bank TPRM teams. The only thing you should custom-build is the thin abstraction glue and the auth-event log — never the primitives. See review §3.

D — Move to a dedicated IAM. ✅ The eventual destination, when triggered.

  • Ory (Kratos + Hydra + Keto) — best architectural fit: API-first, self-hosted, Go (matches your two-language ceiling on the infra side), Keto gives real ABAC/ReBAC. Highest operational cost but lowest lock-in.
  • Keycloak — mature, batteries-included, realms ≈ tenants, strong OIDC/SAML for employer SSO. JVM operational footprint; realm-per-tenant scaling needs care.
  • Auth0 / WorkOS — fastest to enterprise SSO + SCIM, best for "land an enterprise employer this quarter." Highest lock-in and per-MAU cost — dangerous for a high-volume, low-ARPU emerging-market user base (salary workers). Consider WorkOS/Auth0 for the B2B employer-admin realm and self-hosted for the consumer/employee realm.

Decision triggers that should flip A → D (any one):

  1. First partner-bank security questionnaire demanding attested IAM / SSO / SCIM.
  2. First enterprise employer demanding SAML/OIDC SSO for their admins.
  3. Multi-region requirement (session federation pain, R14).
  4. MFA/identity-assurance requirements (e.g. NBE/regulator) beyond what plugins cover.
  5. RBAC needs exceed coarse roles (separation-of-duties across payroll/lending/reconciliation).

6. The decision: keep, migrate, fork, replace, or isolate?

Decision: ISOLATE NOW, KEEP for pilot + post-pilot, MIGRATE later only on a trigger. Do not fork. Do not rewrite.

Concretely:

  • Keep Better Auth as the credential + session + org engine through pilot and early production.
  • Isolate it behind a thin internal IdentityProvider port immediately (this is the single most important architectural move — it makes A reversible and de-risks every future option). This is consistent with the repo's own ports-and-adapters discipline ([[project_ewa_canonical_wiring]] pattern).
  • Migrate to a dedicated IAM (likely Ory, plus WorkOS/Auth0 for enterprise-employer SSO) at the scale-up gate, only when a concrete trigger above fires.
  • Do not fork — reactive contingency only.
  • Do not rewrite — no requirement justifies the risk.

The isolation abstraction (build this now)

The goal: domain + application code never imports better-auth and never reads a Better-Auth-shaped object. They depend on your contracts. Today the leak points are session.middleware.ts (imports better-auth/node) and better-auth.factory.ts. Quarantine both behind:

packages/shared/identity/ ← NEW: vendor-neutral identity contracts (no better-auth import)
IdentityProvider (port) resolveSession(headers) → Principal | null
issueChallenge / verifyOtp / ...
Principal { userId, email, emailVerified, tenantId?, mfaLevel, deviceId? }
AuthorizationService (port) can(principal, action, resource, ctx) → boolean (ABAC-ready)
AuthEventSink (port) record(authEvent) ← immutable audit, IDP-independent

apps/api/src/identity/auth/
better-auth.identity-provider.ts ← the ONLY file that imports better-auth (adapter)
better-auth.factory.ts (unchanged, hidden behind the adapter)

Everything else (SessionMiddleware, guards, controllers) depends on @demoz-pay/shared-identity, not on Better Auth. Swapping to Ory later = writing one new adapter (ory.identity-provider.ts) and flipping a binding. This converts the §D migration matrix from "rewrite" to "new adapter + reconciliation."


7. Ideal long-term fintech IAM architecture

Design principles: default-deny · least privilege · zero-trust between services · every privileged action is attributable and tamper-evident · identity assurance scales with money risk.

7.1 Identity provider architecture — multi-realm

Do not put salary-worker consumers, employer admins, and platform operators in one undifferentiated identity pool. Separate realms with separate policies:

┌─────────────────────────────────────────────────────────────────────────┐
│ REALM 1 — Consumer (employees) primary: phone-OTP, optional passkey │
│ assurance: phone-verified; step-up TOTP/passkey for money actions │
├─────────────────────────────────────────────────────────────────────────┤
│ REALM 2 — Employer / FI / Merchant admins email+password+MFA, or SSO │
│ enterprise tenants: OIDC/SAML federation + SCIM provisioning │
├─────────────────────────────────────────────────────────────────────────┤
│ REALM 3 — Platform operators (DemozPay staff) MFA MANDATORY + step-up │
│ separate IDP/realm, hardware-key preferred, all actions = audit + dual- │
│ control for high-risk ops (KYC override, manual ledger adjustment) │
├─────────────────────────────────────────────────────────────────────────┤
│ REALM 4 — Machines (services, partners) mTLS + workload identity / OAuth │
│ client-credentials; NO human credentials ever │
└─────────────────────────────────────────────────────────────────────────┘

Behind the IdentityProvider port, realms can be served by different engines (e.g. self-hosted Ory Kratos for Realm 1+3, WorkOS/Auth0 for Realm 2 enterprise SSO) without the domain code knowing.

7.2 Session architecture

  • Consumer/admin web: opaque, server-side, revocable sessions (what Better Auth gives you) — keep. Add a short-TTL session cache (Redis) in front of the DB lookup to kill R9; cache holds {principal, expiresAt}, invalidated on logout/revoke.
  • Idle + absolute timeouts: distinct (e.g. 30-min idle, 7-day absolute) — money apps should not have week-long idle sessions.
  • Step-up: a money-moving action requires a session whose mfaLevel/authTime satisfies a freshness + MFA policy; otherwise re-challenge. Encode as a claim on the Principal.
  • "Log out everywhere" + device list: session rows already carry ipAddress/userAgent; expose management UI.

7.3 RBAC → ABAC model

Today: 3 coarse org roles. Target: roles for coarse gating + attribute/policy for fine decisions.

Principal{ userId, tenantId, realm, roles[], mfaLevel, deviceTrust, authTime }
Resource{ type, tenantId, ownerId, amount, status }
Policy(action, principal, resource, env) → permit | deny
e.g. permit "loan.disburse" IF principal.role in {owner,finance}
AND principal.tenantId == resource.tenantId
AND principal.mfaLevel >= 2
AND resource.amount <= principal.approvalLimit
AND NOT sameActor(principal, resource.requestedBy) ← separation of duties

Engine options: Ory Keto (ReBAC), OPA/Cedar (policy-as-code). Keep the AuthorizationService.can(...) port so the engine is swappable. Separation-of-duties (maker/checker) is a regulatory must for payroll disbursement and manual ledger adjustment — coarse roles can't express it.

7.4 Organization / employee–business relationship model

Keep the elegant invariant: Organization.id == Business.id, Member(user, org, role). Extend:

  • An employee belongs to a Business via Employee profile and may have a Member row (for self-service). Distinguish "employee of" (payroll subject) from "member of" (admin actor) — they are different relationships and must not be conflated in authorization.
  • A user may be a Member of several orgs (accountant serving multiple employers); activeOrganizationId gates the request tenant, RLS enforces at DB. This already works — preserve it.

7.5 Audit attribution model (compliance-grade)

Introduce an append-only, tamper-evident AuthEvent log, separate from mutable Session, written in the same spirit as the financial outbox/audit (ADR-008):

AuthEvent{ id, ts, actorUserId, onBehalfOf?, realm, tenantId?, eventType,
ip, deviceId, mfaLevel, result, requestId, prevHash, hash }
eventTypes: LOGIN, LOGOUT, MFA_ENROLL, MFA_CHALLENGE, ROLE_GRANT, ROLE_REVOKE,
ADMIN_ACTION, IMPERSONATION_START/END, PASSWORD_RESET, KEY_ROTATION

Hash-chain (prevHash) makes tampering detectable. This is the artifact a regulator/bank asks for and the current system lacks (R6).

7.6 Machine-to-machine auth (fixes R1)

  • API ↔ ledger / integration-gateway: mTLS (SPIFFE/SPIRE workload identity or, minimally, a private CA issuing per-service certs) + a short-lived service token in gRPC metadata. Replace every credentials.createInsecure(). This is non-negotiable for "ledger is the money truth."
  • Authorization on the ledger side: the ledger must verify which service is calling and reject unexpected callers (defense in depth — don't assume network isolation).

7.7 Bank-partner auth

  • Inbound webhooks: keep HMAC, add a nonce/jti store (Redis with TTL = skew window) to make replays single-use (fixes R7). Per-partner signing keys in the secrets manager, rotatable.
  • Outbound to banks: mTLS + per-partner client credentials (OAuth client-credentials or signed requests), partner credentials never in env files — in a secrets manager (§7.13) with rotation.

7.8 Webhook auth (general)

Standardize the ${timestamp}.${nonce}.${body} HMAC + nonce-store + per-sender key as the one webhook auth primitive (the code comment already anticipates reuse). Document it as a control in SECURITY_CONTROLS.md.

7.9 MFA architecture (fixes R8, R2)

  • Wire the Better Auth TOTP plugin now (table already exists). Backup codes hashed.
  • Mandatory MFA for Realm 3 (operators) and for employer/FI admins doing money operations.
  • Step-up for high-risk consumer actions (large EWA, loan acceptance) even on phone-OTP accounts (re-OTP or passkey).
  • Roadmap to WebAuthn/passkeys for admins (phishing-resistant) — the real win for a fintech.

7.10 KYC linkage

KYC is an identity-assurance attribute, not an auth gate: Principal.identityAssurance ∈ {none, phone, kyc-basic, kyc-full, sanctions-cleared}. Authorization policies reference it (e.g. loan disbursement requires kyc-full + sanctions-cleared). Keep KYC state on the domain side (Business.kycVerified, future per-employee KYC), surface it as a claim. Future AML/sanctions screening becomes another assurance attribute + an AuthEvent/audit record.

7.11 Device trust

Record deviceId (cookie/secure-storage bound) on first auth; treat new device as lower trust → step-up. Feed deviceTrust into ABAC. Especially valuable for the consumer phone-OTP realm where SIM-swap is a real Ethiopian-market threat.

7.12 Admin / operator auth + support impersonation controls (fixes R2)

  • Operators in Realm 3, MFA-mandatory, ideally hardware keys.
  • No standing cross-tenant superuser. Replace the unconditional checkType==='PLATFORM' bypass with scoped, time-boxed, dual-controlled grants: an operator requests access to tenant X for a reason; it's approved (or auto-approved for low-risk read), expires, and is fully audited.
  • Impersonation ("support as user"): explicit IMPERSONATION_START/END AuthEvents, banner in UI, read-mostly by default, write requires elevated approval, never silent. The current code has no impersonation concept — design it before someone builds an ad-hoc backdoor.
  • Reconciliation operator controls: reconciliation/ledger-adjustment is the highest-risk operator surface. Maker/checker (dual control), every adjustment an AuthEvent + financial audit entry, scoped role distinct from general platform-admin.

7.13 Secrets-management interaction

  • BETTER_AUTH_SECRET, BANK_WEBHOOK_SIGNING_KEY, per-partner keys, service mTLS certs → a secrets manager (Vault / cloud KMS), not env files. Rotation runbooks. Key rotation emits KEY_ROTATION AuthEvents.
  • Session-signing/secret rotation must support overlap (old+new valid during rotation) to avoid mass logout.

7.14 Service-to-service auth (internal)

Every internal call (API↔Go services, future domain services) authenticates with workload identity (mTLS/SPIFFE). Zero implicit trust from "same network." The ledger, as money truth, authorizes its callers.

7.15 Zero-trust direction

Target end-state, incrementally:

  • No network position grants trust; every hop authenticates (mTLS) and authorizes (policy).
  • Every principal (human or machine) carries a verifiable identity + assurance level.
  • Every privileged action is least-privilege, time-boxed, and tamper-evidently audited.
  • Money-risk drives identity-assurance and MFA-freshness requirements (risk-adaptive auth).

Target architecture diagram

┌────────────── IDENTITY PLANE ──────────────┐
Consumers (phone-OTP) ──▶ Realm 1 ─┐ │
Employer/FI admins ─────▶ Realm 2 ─┤ IdentityProvider PORT │ ← domain code depends on PORT,
Operators (MFA+keys) ───▶ Realm 3 ─┤ (better-auth adapter today, │ never on the vendor
Services / partners ────▶ Realm 4 ─┘ ory/workos adapter later) │
└───────────────────┬────────────────────────┘
│ Principal{ id, tenant, roles, mfaLevel, assurance, device }
┌───────────────────▼────────────────────────┐
│ AuthN gate → AuthZ (ABAC/ReBAC policy) → │
│ tenant ALS → RLS (+ step-up on money risk) │
└───────────────────┬────────────────────────┘
│ every privileged action →
┌───────────────────▼────────────────────────┐
│ AuthEvent log (append-only, hash-chained) │ ← regulator/bank artifact
└─────────────────────────────────────────────┘
API ◀── mTLS + svc-token ──▶ ledger (Go) partner banks ◀── mTLS + HMAC(+nonce store) ──▶ gateway
secrets/keys ◀── Vault/KMS (rotation, KEY_ROTATION events) ──▶ all of the above

This is the destination. You do not build it all now — you build the ports now (§6), close the 🔴 risks (matrix R1–R4) next, and grow into the rest at the scale-up gate. Sequencing is in AUTH_MIGRATION_STRATEGY.md.