AUTH_MIGRATION_STRATEGY.md — DemozPay
Companion to
AUTH_SYSTEM_REVIEW.md,AUTH_RISK_MATRIX.md,LONG_TERM_IAM_ARCHITECTURE.md. Two tracks: Track 1 — Better Auth stays (isolate + harden). Track 2 — Replace later (phased, triggered).
The chosen path is Track 1 now, Track 2 only on a trigger (triggers in LONG_TERM_IAM §4). Track 1 is also the prerequisite that makes Track 2 cheap — do it regardless.
Track 1 — Better Auth remains: isolate safely + avoid lock-in
1.1 What to abstract NOW (the lock-in firebreak)
Build packages/shared/identity with three vendor-neutral ports, and confine every better-auth import to a single adapter file:
| Port | Contract | Replaces today's coupling |
|---|---|---|
IdentityProvider | resolveSession(headers) → Principal | null, verifyOtp(...), signOut(...) | session.middleware.ts importing better-auth/node; getSession() calls |
AuthorizationService | can(principal, action, resource, ctx) → boolean | hard-coded role checks in org-role.guard.ts |
AuthEventSink | record(authEvent) (append-only, hash-chained) | (new — nothing today) |
Rule to enforce in ESLint module boundaries: only apps/api/src/identity/auth/*.adapter.ts may import better-auth. Everything else imports @demoz-pay/shared-identity. This is the same boundary discipline ADR-011 already applies to domains — extend it to the vendor.
Result: a future swap is "write ory.identity-provider.ts, flip the DI binding" — not a domain rewrite.
1.2 Hardening backlog (close the 🔴/🟠 risks — see matrix)
Priority order = risk score:
- R1 — S2S auth: mTLS + service token on all gRPC channels; remove
createInsecure(). Ledger authorizes callers. - R3 — rate limiting + lockout on
/api/auth/*(Better Auth rate-limit config + an edge/Express limiter, since the subtree is pre-Nest). - R4 — bring auth endpoints under shared controls: put a rate-limiter + WAF/observability in front of
/api/auth/*at the Express/edge layer (it cannot ride Nest guards). - R2 — admin MFA + scope: wire TOTP, make MFA mandatory for platform admins, replace unconditional superuser bypass with scoped time-boxed grants + impersonation audit.
- R8 — wire the TOTP plugin (table already exists).
- R5 — real SMS provider (AfricasTalking/Ethio aggregator) behind the existing
SmsSenderport + OTP rate limit; alert onLoggingSmsSenderin prod. - R6 — AuthEvent log (append-only, hash-chained) via
AuthEventSink. - R7 — webhook nonce store (Redis, TTL = skew window).
- R9 — session cache (short-TTL Redis) in the
IdentityProvideradapter. - R10/R12 — guardrails: a unit/integration test asserting every non-
@Public()money route carries an RBAC decorator and is in the middleware allow-list (or move to a default-applied middleware). - R11 — collapse the dual role model; R15 — password policy; R17 — delete legacy
shared/auth/auth.ts.
None of these require leaving Better Auth. Most are additions around it.
Track 2 — Replace later: safest phased migration
Trigger-gated (LONG_TERM_IAM §4). When a trigger fires, execute the strangler pattern — never a big-bang cutover of money infrastructure.
Phase 0 — Prerequisite (done in Track 1)
Abstraction ports exist; domain code is vendor-neutral; AuthEvent log already independent of the IDP. Migration cannot safely start until this holds.
Phase 1 — Stand up the new IDP in shadow
- Deploy target (e.g. Ory Kratos for consumer/operator realms; WorkOS/Auth0 for enterprise-employer SSO).
- New IDP runs read-only / shadow: no production traffic; identities provisioned/synced from existing
User/Member. - Write a second
IdentityProvideradapter; behind a feature flag, off in prod.
Phase 2 — Migrate credentials & identities
- Passwords: do not force reset. Keep
Account.password(bcrypt) readable; on first login against the new IDP, verify against the old hash, then rehash into the new store (lazy migration). Users never notice. - Phone/email + verification state: copy
emailVerified,phoneNumberVerified. - Org/member: preserve
Organization.id == Business.id; mapMember(user, org, role)1:1. Run a reconciliation job that diffs old vs new continuously and alerts on drift. - MFA: TOTP secrets migrate if format-compatible; otherwise prompt re-enrolment at next login (acceptable, security-positive).
Phase 3 — Dual-run sessions (backward compatibility)
SessionMiddleware's adapter dual-reads: accept a valid old Better Auth session OR a new IDP session. Both resolve to the samePrincipal.- New logins issue new sessions; old sessions remain valid until natural expiry (≤7 days). Expire, don't revoke — no mass re-login event.
- Tenant context + RLS are unchanged — they live at the DB/ALS layer, decoupled from the IDP. This is the key reason the migration is low-risk: the money-isolation control never moves.
Phase 4 — Cut over by realm, smallest blast radius first
Order: operators (Realm 3, smallest, highest control) → employer/FI admins (Realm 2) → consumers (Realm 1, largest). Per realm: enable new-IDP logins, monitor, hold dual-read for one full session-TTL window, then disable old logins for that realm.
Phase 5 — Decommission
- After all realms cut over + one TTL window with zero old-session reads: remove the Better Auth adapter, drop unused columns/tables behind a migration (identity tables are not ADR-009 financial rows, so deletion is allowed — see
schema.prisma:116). - Keep the AuthEvent history intact across the boundary (it was never IDP-coupled).
Session migration strategy (summary)
| Concern | Strategy |
|---|---|
| Existing sessions | Honoured until expiry (dual-read); never bulk-revoked |
| New sessions | Issued by new IDP from cutover |
| "Log out everywhere" | Works in both during dual-run |
| Rollback | Flip the feature flag back to old adapter; old sessions still valid |
Tenant migration strategy
There is none — by design. Tenancy = activeOrganizationId → ALS → RLS at the DB. It is IDP-independent. The new IDP only needs to populate Principal.tenantId. Do not couple the RLS layer to the IDP migration; keeping them orthogonal is what makes this safe.
Migration risk matrix
See AUTH_RISK_MATRIX.md §D.
FINAL CTO RECOMMENDATION
Keep Better Auth. Isolate it behind a vendor-neutral identity port now. Spend your auth budget closing the five real gaps — not on a rewrite.
Better Auth is an acceptable foundation for pilot and early production, and the integration here is above-average: session-derived identity, headers rejected as identity, RLS-backed tenancy, single role source, fail-closed guards. The library is not the risk. The risk is the absence of controls that every auth approach needs and that no library hands you for free:
- Service-to-service auth (gRPC is
createInsecure()— the money tier is unauthenticated). Fix first. - Admin MFA + the elimination of standing cross-tenant superuser.
- Rate limiting / brute-force protection on auth endpoints.
- A production SMS provider for the primary (phone-OTP) market path.
- An immutable, hash-chained auth-event audit log.
Rewriting auth (Option C) is not justified — six-to-nine months to reach parity while re-opening solved account-takeover bugs, and "we rolled our own auth" weakens, not strengthens, a bank/regulator review. Forking (Option B) is a contingency, not a plan. A dedicated IAM (Option D, likely Ory + WorkOS/Auth0 for enterprise SSO) is the eventual destination, adopted at the scale-up gate on a concrete trigger — and the isolation port built today makes that move a new adapter, not a rewrite.
"What I would do if this were my fintech"
- Treat the gRPC
createInsecure()finding as a sev-1 architectural defect and fix it before the next employer goes live. Unauthenticated cleartext RPC to the ledger is the thing that ends a partner-bank relationship. - Build the
IdentityProvider/AuthEventSinkports in the next sprint. Cheap now, priceless later. It's the difference between "swap an adapter" and "rewrite auth under regulatory deadline pressure." - Make MFA non-negotiable for my own staff on day one, and design impersonation + dual-control before anyone builds an ad-hoc admin backdoor. Insider/operator compromise is the most likely way a fintech like this actually loses money.
- Ship phone-OTP with a real provider + rate limiting, or not at all. A primary auth path that silently drops codes and has no brute-force protection is worse than a known-disabled feature.
- Keep tenancy at the database (RLS) forever, decoupled from whatever IDP I run. That single decision is why an auth migration here can be low-risk. Never regress it.
- Pick Ory for the consumer/operator realms and WorkOS/Auth0 only for enterprise-employer SSO when SSO/SCIM is demanded — never put millions of low-ARPU salary workers on per-MAU vendor pricing.
Recommended next 30 / 90 / 180-day auth roadmap
Next 30 days — stop the bleeding (the 🔴s)
- R1: mTLS + service token on all API↔ledger / API↔gateway gRPC; remove every
createInsecure(); ledger authorizes callers. - R3/R4: rate limiting + lockout on
/api/auth/*at the Express/edge layer; basic WAF + observability on the auth subtree. - R2 (start): wire the TOTP plugin; make MFA mandatory for platform admins.
- Abstraction (start): create
packages/shared/identityports; routeSessionMiddlewarethrough theIdentityProvideradapter; add the ESLint boundary (only*.adapter.tsimportsbetter-auth). - Decision record: write an ADR-014 "Identity foundation: Better Auth, isolated, with scale-up migration triggers."
Next 90 days — production-grade controls
- R2 (finish): replace unconditional platform-admin bypass with scoped, time-boxed, dual-controlled grants; design + ship impersonation with explicit start/end audit.
- R5: real SMS provider behind
SmsSender+ OTP rate limiting; prod alert ifLoggingSmsSenderis active. - R6: append-only hash-chained
AuthEventlog viaAuthEventSink(LOGIN, MFA, ROLE_GRANT, ADMIN_ACTION, IMPERSONATION…). - R7: webhook nonce store (single-use within skew window).
- R8: step-up MFA for high-risk money actions.
- R10/R12: test/lint guardrail asserting every money route has RBAC + middleware coverage; reconsider default-applied middleware.
- R11/R15/R17: collapse dual role model; strengthen password policy (breach check); delete legacy
shared/auth/auth.ts. - Secrets: move
BETTER_AUTH_SECRET, signing keys, partner keys to Vault/KMS with rotation runbooks.
Next 180 days — scale & enterprise readiness
- R9: Redis session cache behind the
IdentityProvideradapter. - R14: evaluate multi-region session strategy; confirm whether a scale-up trigger has fired.
- ABAC: introduce policy-as-code (OPA/Cedar or Ory Keto) behind
AuthorizationService; implement maker/checker separation-of-duties for payroll disbursement, loan disbursement, and ledger adjustments. - Realms: formalize the 4-realm model; prepare enterprise-employer SSO (OIDC/SAML) + SCIM via WorkOS/Auth0 adapter.
- Device trust + passkeys roadmap for admins (phishing-resistant MFA).
- Migration readiness review: if a trigger fired, execute Track 2 Phase 1–2 (shadow IDP + lazy credential migration). If not, document the deferral and re-evaluate quarterly.
Cadence note: re-run this review at each gate (post-pilot, pre-scale, pre-enterprise-employer). The triggers in LONG_TERM_IAM §4 are the decision criteria — let events, not calendars, drive the A→D move.