Skip to main content

AUTH_MIGRATION_STRATEGY.md — DemozPay

Companion to AUTH_SYSTEM_REVIEW.md, AUTH_RISK_MATRIX.md, LONG_TERM_IAM_ARCHITECTURE.md. Two tracks: Track 1 — Better Auth stays (isolate + harden). Track 2 — Replace later (phased, triggered).

The chosen path is Track 1 now, Track 2 only on a trigger (triggers in LONG_TERM_IAM §4). Track 1 is also the prerequisite that makes Track 2 cheap — do it regardless.


Track 1 — Better Auth remains: isolate safely + avoid lock-in

1.1 What to abstract NOW (the lock-in firebreak)

Build packages/shared/identity with three vendor-neutral ports, and confine every better-auth import to a single adapter file:

PortContractReplaces today's coupling
IdentityProviderresolveSession(headers) → Principal | null, verifyOtp(...), signOut(...)session.middleware.ts importing better-auth/node; getSession() calls
AuthorizationServicecan(principal, action, resource, ctx) → booleanhard-coded role checks in org-role.guard.ts
AuthEventSinkrecord(authEvent) (append-only, hash-chained)(new — nothing today)

Rule to enforce in ESLint module boundaries: only apps/api/src/identity/auth/*.adapter.ts may import better-auth. Everything else imports @demoz-pay/shared-identity. This is the same boundary discipline ADR-011 already applies to domains — extend it to the vendor.

Result: a future swap is "write ory.identity-provider.ts, flip the DI binding" — not a domain rewrite.

1.2 Hardening backlog (close the 🔴/🟠 risks — see matrix)

Priority order = risk score:

  1. R1 — S2S auth: mTLS + service token on all gRPC channels; remove createInsecure(). Ledger authorizes callers.
  2. R3 — rate limiting + lockout on /api/auth/* (Better Auth rate-limit config + an edge/Express limiter, since the subtree is pre-Nest).
  3. R4 — bring auth endpoints under shared controls: put a rate-limiter + WAF/observability in front of /api/auth/* at the Express/edge layer (it cannot ride Nest guards).
  4. R2 — admin MFA + scope: wire TOTP, make MFA mandatory for platform admins, replace unconditional superuser bypass with scoped time-boxed grants + impersonation audit.
  5. R8 — wire the TOTP plugin (table already exists).
  6. R5 — real SMS provider (AfricasTalking/Ethio aggregator) behind the existing SmsSender port + OTP rate limit; alert on LoggingSmsSender in prod.
  7. R6 — AuthEvent log (append-only, hash-chained) via AuthEventSink.
  8. R7 — webhook nonce store (Redis, TTL = skew window).
  9. R9 — session cache (short-TTL Redis) in the IdentityProvider adapter.
  10. R10/R12 — guardrails: a unit/integration test asserting every non-@Public() money route carries an RBAC decorator and is in the middleware allow-list (or move to a default-applied middleware).
  11. R11 — collapse the dual role model; R15 — password policy; R17 — delete legacy shared/auth/auth.ts.

None of these require leaving Better Auth. Most are additions around it.


Track 2 — Replace later: safest phased migration

Trigger-gated (LONG_TERM_IAM §4). When a trigger fires, execute the strangler pattern — never a big-bang cutover of money infrastructure.

Phase 0 — Prerequisite (done in Track 1)

Abstraction ports exist; domain code is vendor-neutral; AuthEvent log already independent of the IDP. Migration cannot safely start until this holds.

Phase 1 — Stand up the new IDP in shadow

  • Deploy target (e.g. Ory Kratos for consumer/operator realms; WorkOS/Auth0 for enterprise-employer SSO).
  • New IDP runs read-only / shadow: no production traffic; identities provisioned/synced from existing User/Member.
  • Write a second IdentityProvider adapter; behind a feature flag, off in prod.

Phase 2 — Migrate credentials & identities

  • Passwords: do not force reset. Keep Account.password (bcrypt) readable; on first login against the new IDP, verify against the old hash, then rehash into the new store (lazy migration). Users never notice.
  • Phone/email + verification state: copy emailVerified, phoneNumberVerified.
  • Org/member: preserve Organization.id == Business.id; map Member(user, org, role) 1:1. Run a reconciliation job that diffs old vs new continuously and alerts on drift.
  • MFA: TOTP secrets migrate if format-compatible; otherwise prompt re-enrolment at next login (acceptable, security-positive).

Phase 3 — Dual-run sessions (backward compatibility)

  • SessionMiddleware's adapter dual-reads: accept a valid old Better Auth session OR a new IDP session. Both resolve to the same Principal.
  • New logins issue new sessions; old sessions remain valid until natural expiry (≤7 days). Expire, don't revoke — no mass re-login event.
  • Tenant context + RLS are unchanged — they live at the DB/ALS layer, decoupled from the IDP. This is the key reason the migration is low-risk: the money-isolation control never moves.

Phase 4 — Cut over by realm, smallest blast radius first

Order: operators (Realm 3, smallest, highest control) → employer/FI admins (Realm 2) → consumers (Realm 1, largest). Per realm: enable new-IDP logins, monitor, hold dual-read for one full session-TTL window, then disable old logins for that realm.

Phase 5 — Decommission

  • After all realms cut over + one TTL window with zero old-session reads: remove the Better Auth adapter, drop unused columns/tables behind a migration (identity tables are not ADR-009 financial rows, so deletion is allowed — see schema.prisma:116).
  • Keep the AuthEvent history intact across the boundary (it was never IDP-coupled).

Session migration strategy (summary)

ConcernStrategy
Existing sessionsHonoured until expiry (dual-read); never bulk-revoked
New sessionsIssued by new IDP from cutover
"Log out everywhere"Works in both during dual-run
RollbackFlip the feature flag back to old adapter; old sessions still valid

Tenant migration strategy

There is none — by design. Tenancy = activeOrganizationId → ALS → RLS at the DB. It is IDP-independent. The new IDP only needs to populate Principal.tenantId. Do not couple the RLS layer to the IDP migration; keeping them orthogonal is what makes this safe.

Migration risk matrix

See AUTH_RISK_MATRIX.md §D.


FINAL CTO RECOMMENDATION

Keep Better Auth. Isolate it behind a vendor-neutral identity port now. Spend your auth budget closing the five real gaps — not on a rewrite.

Better Auth is an acceptable foundation for pilot and early production, and the integration here is above-average: session-derived identity, headers rejected as identity, RLS-backed tenancy, single role source, fail-closed guards. The library is not the risk. The risk is the absence of controls that every auth approach needs and that no library hands you for free:

  1. Service-to-service auth (gRPC is createInsecure() — the money tier is unauthenticated). Fix first.
  2. Admin MFA + the elimination of standing cross-tenant superuser.
  3. Rate limiting / brute-force protection on auth endpoints.
  4. A production SMS provider for the primary (phone-OTP) market path.
  5. An immutable, hash-chained auth-event audit log.

Rewriting auth (Option C) is not justified — six-to-nine months to reach parity while re-opening solved account-takeover bugs, and "we rolled our own auth" weakens, not strengthens, a bank/regulator review. Forking (Option B) is a contingency, not a plan. A dedicated IAM (Option D, likely Ory + WorkOS/Auth0 for enterprise SSO) is the eventual destination, adopted at the scale-up gate on a concrete trigger — and the isolation port built today makes that move a new adapter, not a rewrite.

"What I would do if this were my fintech"

  • Treat the gRPC createInsecure() finding as a sev-1 architectural defect and fix it before the next employer goes live. Unauthenticated cleartext RPC to the ledger is the thing that ends a partner-bank relationship.
  • Build the IdentityProvider / AuthEventSink ports in the next sprint. Cheap now, priceless later. It's the difference between "swap an adapter" and "rewrite auth under regulatory deadline pressure."
  • Make MFA non-negotiable for my own staff on day one, and design impersonation + dual-control before anyone builds an ad-hoc admin backdoor. Insider/operator compromise is the most likely way a fintech like this actually loses money.
  • Ship phone-OTP with a real provider + rate limiting, or not at all. A primary auth path that silently drops codes and has no brute-force protection is worse than a known-disabled feature.
  • Keep tenancy at the database (RLS) forever, decoupled from whatever IDP I run. That single decision is why an auth migration here can be low-risk. Never regress it.
  • Pick Ory for the consumer/operator realms and WorkOS/Auth0 only for enterprise-employer SSO when SSO/SCIM is demanded — never put millions of low-ARPU salary workers on per-MAU vendor pricing.

Next 30 days — stop the bleeding (the 🔴s)

  • R1: mTLS + service token on all API↔ledger / API↔gateway gRPC; remove every createInsecure(); ledger authorizes callers.
  • R3/R4: rate limiting + lockout on /api/auth/* at the Express/edge layer; basic WAF + observability on the auth subtree.
  • R2 (start): wire the TOTP plugin; make MFA mandatory for platform admins.
  • Abstraction (start): create packages/shared/identity ports; route SessionMiddleware through the IdentityProvider adapter; add the ESLint boundary (only *.adapter.ts imports better-auth).
  • Decision record: write an ADR-014 "Identity foundation: Better Auth, isolated, with scale-up migration triggers."

Next 90 days — production-grade controls

  • R2 (finish): replace unconditional platform-admin bypass with scoped, time-boxed, dual-controlled grants; design + ship impersonation with explicit start/end audit.
  • R5: real SMS provider behind SmsSender + OTP rate limiting; prod alert if LoggingSmsSender is active.
  • R6: append-only hash-chained AuthEvent log via AuthEventSink (LOGIN, MFA, ROLE_GRANT, ADMIN_ACTION, IMPERSONATION…).
  • R7: webhook nonce store (single-use within skew window).
  • R8: step-up MFA for high-risk money actions.
  • R10/R12: test/lint guardrail asserting every money route has RBAC + middleware coverage; reconsider default-applied middleware.
  • R11/R15/R17: collapse dual role model; strengthen password policy (breach check); delete legacy shared/auth/auth.ts.
  • Secrets: move BETTER_AUTH_SECRET, signing keys, partner keys to Vault/KMS with rotation runbooks.

Next 180 days — scale & enterprise readiness

  • R9: Redis session cache behind the IdentityProvider adapter.
  • R14: evaluate multi-region session strategy; confirm whether a scale-up trigger has fired.
  • ABAC: introduce policy-as-code (OPA/Cedar or Ory Keto) behind AuthorizationService; implement maker/checker separation-of-duties for payroll disbursement, loan disbursement, and ledger adjustments.
  • Realms: formalize the 4-realm model; prepare enterprise-employer SSO (OIDC/SAML) + SCIM via WorkOS/Auth0 adapter.
  • Device trust + passkeys roadmap for admins (phishing-resistant MFA).
  • Migration readiness review: if a trigger fired, execute Track 2 Phase 1–2 (shadow IDP + lazy credential migration). If not, document the deferral and re-evaluate quarterly.

Cadence note: re-run this review at each gate (post-pilot, pre-scale, pre-enterprise-employer). The triggers in LONG_TERM_IAM §4 are the decision criteria — let events, not calendars, drive the A→D move.