AUTH_MIGRATION_STRATEGY.md — DemozPay

Companion to AUTH_SYSTEM_REVIEW.md, AUTH_RISK_MATRIX.md, LONG_TERM_IAM_ARCHITECTURE.md. Two tracks: Track 1 — Better Auth stays (isolate + harden). Track 2 — Replace later (phased, triggered).

The chosen path is Track 1 now, Track 2 only on a trigger (triggers in LONG_TERM_IAM §4). Track 1 is also the prerequisite that makes Track 2 cheap — do it regardless.

Track 1 — Better Auth remains: isolate safely + avoid lock-in

1.1 What to abstract NOW (the lock-in firebreak)

Build packages/shared/identity with three vendor-neutral ports, and confine every better-auth import to a single adapter file:

Port	Contract	Replaces today's coupling
`IdentityProvider`	`resolveSession(headers) → Principal \| null`, `verifyOtp(...)`, `signOut(...)`	`session.middleware.ts` importing `better-auth/node`; `getSession()` calls
`AuthorizationService`	`can(principal, action, resource, ctx) → boolean`	hard-coded role checks in `org-role.guard.ts`
`AuthEventSink`	`record(authEvent)` (append-only, hash-chained)	(new — nothing today)

Rule to enforce in ESLint module boundaries: only apps/api/src/identity/auth/*.adapter.ts may import better-auth. Everything else imports @demoz-pay/shared-identity. This is the same boundary discipline ADR-011 already applies to domains — extend it to the vendor.

Result: a future swap is "write ory.identity-provider.ts, flip the DI binding" — not a domain rewrite.

1.2 Hardening backlog (close the 🔴/🟠 risks — see matrix)

Priority order = risk score:

R1 — S2S auth: mTLS + service token on all gRPC channels; remove createInsecure(). Ledger authorizes callers.
R3 — rate limiting + lockout on /api/auth/* (Better Auth rate-limit config + an edge/Express limiter, since the subtree is pre-Nest).
R4 — bring auth endpoints under shared controls: put a rate-limiter + WAF/observability in front of /api/auth/* at the Express/edge layer (it cannot ride Nest guards).
R2 — admin MFA + scope: wire TOTP, make MFA mandatory for platform admins, replace unconditional superuser bypass with scoped time-boxed grants + impersonation audit.
R8 — wire the TOTP plugin (table already exists).
R5 — real SMS provider (AfricasTalking/Ethio aggregator) behind the existing SmsSender port + OTP rate limit; alert on LoggingSmsSender in prod.
R6 — AuthEvent log (append-only, hash-chained) via AuthEventSink.
R7 — webhook nonce store (Redis, TTL = skew window).
R9 — session cache (short-TTL Redis) in the IdentityProvider adapter.
R10/R12 — guardrails: a unit/integration test asserting every non-@Public() money route carries an RBAC decorator and is in the middleware allow-list (or move to a default-applied middleware).
R11 — collapse the dual role model; R15 — password policy; R17 — delete legacy shared/auth/auth.ts.

None of these require leaving Better Auth. Most are additions around it.

Track 2 — Replace later: safest phased migration

Trigger-gated (LONG_TERM_IAM §4). When a trigger fires, execute the strangler pattern — never a big-bang cutover of money infrastructure.

Phase 0 — Prerequisite (done in Track 1)

Abstraction ports exist; domain code is vendor-neutral; AuthEvent log already independent of the IDP. Migration cannot safely start until this holds.

Phase 1 — Stand up the new IDP in shadow

Deploy target (e.g. Ory Kratos for consumer/operator realms; WorkOS/Auth0 for enterprise-employer SSO).
New IDP runs read-only / shadow: no production traffic; identities provisioned/synced from existing User/Member.
Write a second IdentityProvider adapter; behind a feature flag, off in prod.

Phase 2 — Migrate credentials & identities

Passwords: do not force reset. Keep Account.password (bcrypt) readable; on first login against the new IDP, verify against the old hash, then rehash into the new store (lazy migration). Users never notice.
Phone/email + verification state: copy emailVerified, phoneNumberVerified.
Org/member: preserve Organization.id == Business.id; map Member(user, org, role) 1:1. Run a reconciliation job that diffs old vs new continuously and alerts on drift.
MFA: TOTP secrets migrate if format-compatible; otherwise prompt re-enrolment at next login (acceptable, security-positive).

Phase 3 — Dual-run sessions (backward compatibility)

SessionMiddleware's adapter dual-reads: accept a valid old Better Auth session OR a new IDP session. Both resolve to the same Principal.
New logins issue new sessions; old sessions remain valid until natural expiry (≤7 days). Expire, don't revoke — no mass re-login event.
Tenant context + RLS are unchanged — they live at the DB/ALS layer, decoupled from the IDP. This is the key reason the migration is low-risk: the money-isolation control never moves.

Phase 4 — Cut over by realm, smallest blast radius first

Order: operators (Realm 3, smallest, highest control) → employer/FI admins (Realm 2) → consumers (Realm 1, largest). Per realm: enable new-IDP logins, monitor, hold dual-read for one full session-TTL window, then disable old logins for that realm.

Phase 5 — Decommission

After all realms cut over + one TTL window with zero old-session reads: remove the Better Auth adapter, drop unused columns/tables behind a migration (identity tables are not ADR-009 financial rows, so deletion is allowed — see schema.prisma:116).
Keep the AuthEvent history intact across the boundary (it was never IDP-coupled).

Session migration strategy (summary)

Concern	Strategy
Existing sessions	Honoured until expiry (dual-read); never bulk-revoked
New sessions	Issued by new IDP from cutover
"Log out everywhere"	Works in both during dual-run
Rollback	Flip the feature flag back to old adapter; old sessions still valid

Tenant migration strategy

There is none — by design. Tenancy = activeOrganizationId → ALS → RLS at the DB. It is IDP-independent. The new IDP only needs to populate Principal.tenantId. Do not couple the RLS layer to the IDP migration; keeping them orthogonal is what makes this safe.

Migration risk matrix

See AUTH_RISK_MATRIX.md §D.

FINAL CTO RECOMMENDATION

Keep Better Auth. Isolate it behind a vendor-neutral identity port now. Spend your auth budget closing the five real gaps — not on a rewrite.

Better Auth is an acceptable foundation for pilot and early production, and the integration here is above-average: session-derived identity, headers rejected as identity, RLS-backed tenancy, single role source, fail-closed guards. The library is not the risk. The risk is the absence of controls that every auth approach needs and that no library hands you for free:

Service-to-service auth (gRPC is createInsecure() — the money tier is unauthenticated). Fix first.
Admin MFA + the elimination of standing cross-tenant superuser.
Rate limiting / brute-force protection on auth endpoints.
A production SMS provider for the primary (phone-OTP) market path.
An immutable, hash-chained auth-event audit log.

Rewriting auth (Option C) is not justified — six-to-nine months to reach parity while re-opening solved account-takeover bugs, and "we rolled our own auth" weakens, not strengthens, a bank/regulator review. Forking (Option B) is a contingency, not a plan. A dedicated IAM (Option D, likely Ory + WorkOS/Auth0 for enterprise SSO) is the eventual destination, adopted at the scale-up gate on a concrete trigger — and the isolation port built today makes that move a new adapter, not a rewrite.

"What I would do if this were my fintech"

Treat the gRPC createInsecure() finding as a sev-1 architectural defect and fix it before the next employer goes live. Unauthenticated cleartext RPC to the ledger is the thing that ends a partner-bank relationship.
Build the IdentityProvider / AuthEventSink ports in the next sprint. Cheap now, priceless later. It's the difference between "swap an adapter" and "rewrite auth under regulatory deadline pressure."
Make MFA non-negotiable for my own staff on day one, and design impersonation + dual-control before anyone builds an ad-hoc admin backdoor. Insider/operator compromise is the most likely way a fintech like this actually loses money.
Ship phone-OTP with a real provider + rate limiting, or not at all. A primary auth path that silently drops codes and has no brute-force protection is worse than a known-disabled feature.
Keep tenancy at the database (RLS) forever, decoupled from whatever IDP I run. That single decision is why an auth migration here can be low-risk. Never regress it.
Pick Ory for the consumer/operator realms and WorkOS/Auth0 only for enterprise-employer SSO when SSO/SCIM is demanded — never put millions of low-ARPU salary workers on per-MAU vendor pricing.

Recommended next 30 / 90 / 180-day auth roadmap

Next 30 days — stop the bleeding (the 🔴s)

R1: mTLS + service token on all API↔ledger / API↔gateway gRPC; remove every createInsecure(); ledger authorizes callers.
R3/R4: rate limiting + lockout on /api/auth/* at the Express/edge layer; basic WAF + observability on the auth subtree.
R2 (start): wire the TOTP plugin; make MFA mandatory for platform admins.
Abstraction (start): create packages/shared/identity ports; route SessionMiddleware through the IdentityProvider adapter; add the ESLint boundary (only *.adapter.ts imports better-auth).
Decision record: write an ADR-014 "Identity foundation: Better Auth, isolated, with scale-up migration triggers."

Next 90 days — production-grade controls

R2 (finish): replace unconditional platform-admin bypass with scoped, time-boxed, dual-controlled grants; design + ship impersonation with explicit start/end audit.
R5: real SMS provider behind SmsSender + OTP rate limiting; prod alert if LoggingSmsSender is active.
R6: append-only hash-chained AuthEvent log via AuthEventSink (LOGIN, MFA, ROLE_GRANT, ADMIN_ACTION, IMPERSONATION…).
R7: webhook nonce store (single-use within skew window).
R8: step-up MFA for high-risk money actions.
R10/R12: test/lint guardrail asserting every money route has RBAC + middleware coverage; reconsider default-applied middleware.
R11/R15/R17: collapse dual role model; strengthen password policy (breach check); delete legacy shared/auth/auth.ts.
Secrets: move BETTER_AUTH_SECRET, signing keys, partner keys to Vault/KMS with rotation runbooks.

Next 180 days — scale & enterprise readiness

R9: Redis session cache behind the IdentityProvider adapter.
R14: evaluate multi-region session strategy; confirm whether a scale-up trigger has fired.
ABAC: introduce policy-as-code (OPA/Cedar or Ory Keto) behind AuthorizationService; implement maker/checker separation-of-duties for payroll disbursement, loan disbursement, and ledger adjustments.
Realms: formalize the 4-realm model; prepare enterprise-employer SSO (OIDC/SAML) + SCIM via WorkOS/Auth0 adapter.
Device trust + passkeys roadmap for admins (phishing-resistant MFA).
Migration readiness review: if a trigger fired, execute Track 2 Phase 1–2 (shadow IDP + lazy credential migration). If not, document the deferral and re-evaluate quarterly.

Cadence note: re-run this review at each gate (post-pilot, pre-scale, pre-enterprise-employer). The triggers in LONG_TERM_IAM §4 are the decision criteria — let events, not calendars, drive the A→D move.

Track 1 — Better Auth remains: isolate safely + avoid lock-in​

1.1 What to abstract NOW (the lock-in firebreak)​

1.2 Hardening backlog (close the 🔴/🟠 risks — see matrix)​

Track 2 — Replace later: safest phased migration​

Phase 0 — Prerequisite (done in Track 1)​

Phase 1 — Stand up the new IDP in shadow​

Phase 2 — Migrate credentials & identities​

Phase 3 — Dual-run sessions (backward compatibility)​

Phase 4 — Cut over by realm, smallest blast radius first​

Phase 5 — Decommission​

Session migration strategy (summary)​

Tenant migration strategy​

Migration risk matrix​

FINAL CTO RECOMMENDATION​

"What I would do if this were my fintech"​

Recommended next 30 / 90 / 180-day auth roadmap​

Next 30 days — stop the bleeding (the 🔴s)​

Next 90 days — production-grade controls​

Next 180 days — scale & enterprise readiness​