AUTH_SYSTEM_REVIEW.md — DemozPay Identity & Access Architecture Review

⚠️ STALE (2026-05-29 snapshot). Several of this review's scariest findings have since been built and are no longer accurate: MFA/TOTP (better-auth.factory.ts + AdminMfaGuard), rate limiting on auth endpoints (auth-rate-limit.ts + spec), and the append-only auth event log (prisma-auth-event-sink.ts) are all Live; cross-kind SoD and platform-scoped roles also shipped. The current real gaps are elsewhere — thin test coverage of the RBAC enforcement core (plan E1) and the SessionMiddleware.forRoutes drift (plan A5). Verify any "not implemented" claim below against code before acting.

Status: Review — architecture-grade analysis (2026-05-29; partially overtaken — see banner). Reviewer roles: Principal Security Architect · Fintech CTO · IAM reviewer · Multi-tenant platform architect. Scope: the auth stack as it existed on restructure/modular-monolith-scaffold (2026-05-29). Companion docs: AUTH_RISK_MATRIX.md · LONG_TERM_IAM_ARCHITECTURE.md · AUTH_MIGRATION_STRATEGY.md

0. TL;DR for the busy CTO

Better Auth is not your problem. Your problem is everything around it.

The library choice (Better Auth) is acceptable for pilot and early production, and the integration is, frankly, better than most fintech codebases at this stage — session-derived identity, headers rejected as identity sources, RLS at the financial tier, single source of role truth on Member, fail-closed guards. Whoever did the A1/A2/A4 hardening sweep understood the threat model.

But there are five real risks that have nothing to do with which auth library you picked, and four of them will hurt you regardless of whether you keep, fork, or replace Better Auth:

Service-to-service auth is absent. API → Go ledger and API → integration-gateway use credentials.createInsecure(). No TLS, no mTLS, no service identity. This is the single most serious finding. A regulator or partner-bank security review fails here, not at Better Auth.
Admin / platform-operator auth is a single point of total compromise. AdminProfile.checkType === 'PLATFORM' grants cross-tenant superuser with no MFA requirement, no step-up auth, no separate realm. One phished operator password = full platform breach across every employer's payroll.
The primary auth path for the target market (phone OTP) is not production-ready. LoggingSmsSender drops OTPs in prod. There is no rate limiting / brute-force lockout wired on OTP or password endpoints.
The auth subtree bypasses the Nest pipeline. /api/auth/* is mounted on raw Express before setGlobalPrefix, so it sits outside Nest guards, validation pipes, rate limiters, and observability interceptors. Whatever protection you add at the Nest layer does not protect login.
Webhook HMAC has a replay window. The nonce is signed but never stored/checked, so a captured signed webhook is replayable for 5 minutes.

Recommendation in one line: Keep Better Auth, isolate it behind a thin internal IAM abstraction now (so it is replaceable later without a domain rewrite), and spend your auth budget on the five gaps above — not on a rewrite. Re-evaluate a move to a dedicated IAM (Ory/Keycloak) at the scale-up gate, triggered by concrete events (first partner-bank security questionnaire, first multi-region requirement, OIDC-federation demand), not by speculation.

1. Is Better Auth acceptable for…?

Context	Verdict	Why
Pilot	✅ Yes	Session model, org/member, RLS-backed tenancy are sound. The gaps below are pilot-tolerable if explicitly risk-accepted and the pilot is low-volume / known employers.
Production (early, single region, <50 employers)	🟡 Conditional	Acceptable only after closing: S2S auth, admin MFA, OTP provider + rate limiting. The library is fine; the surrounding controls are not yet.
Enterprise fintech scale	🟡 Conditional, with caveats	Per-request `getSession()` DB hit and per-RBAC-request `AdminProfile`+`Member` lookups will need caching. The org/member model is coarse (3 roles). Workable but you will be hardening Better Auth, not leaning on it.
Regulated banking integrations	🟠 Not yet	Not a Better Auth deficiency — it's the absence of M2M/service identity, admin MFA, and immutable auth audit trails. A bank's TPRM/security review fails on S2S `createInsecure()` and admin-without-MFA today.
Multi-bank future	🟡 Neutral	Better Auth is irrelevant here. Multi-bank is an integration-gateway + partner-credential-vault problem (see LONG_TERM_IAM §M2M / partner auth). Better Auth neither helps nor hurts.
Multi-region future	🟠 Becomes a constraint	DB-backed opaque sessions mean session lookups must hit a region-local replica or a shared session store; Better Auth's session model is single-writer-DB oriented. Solvable, but this is the most likely forcing function for a future IAM migration.

Bottom line: Better Auth clears pilot and early-production bars. The bars it does not clear are bars that custom auth and Better Auth fail equally — they are control gaps, not library gaps.

2. Real risks of CONTINUING with Better Auth

Class	Risk	Severity	Notes
Operational	Per-request `auth.api.getSession()` hits Postgres on every protected route (`session.middleware.ts:44`). No session cache.	Med	Becomes a latency + DB-connection bottleneck at scale. Mitigate with a short-TTL session cache (Redis) behind the abstraction.
Operational	Auth subtree mounted on raw Express (`main.ts:41-42`) — no shared rate limiting, validation, or tracing on login/OTP/reset.	High	You cannot add a global Nest control and assume login is covered. Every protection must be added at the Better Auth config or Express layer explicitly.
Security	No rate limiting / lockout on password + OTP endpoints observed. OTP brute-force + credential stuffing are open.	High	Better Auth supports rate limiting; it is not configured here. This is the highest-likelihood real-world attack.
Security	Phone OTP is the stated primary path but `LoggingSmsSender` drops codes in prod (`sms-sender.ts:30`). Users either can't log in or you ship with a placeholder.	High	Functional + security gap: shipping phone-OTP without a real provider and without rate limiting is worse than not shipping it.
Security	`requireEmailVerification` only `true` in production (`better-auth.factory.ts:46`) — fine — but no breach-password check, no complexity policy beyond min-length 8.	Low–Med	Acceptable for pilot; tighten before scale.
Maintenance	Dual role model: `Member.role` (owner/admin/member) and legacy `Role`+`permissions Json` table (`schema.prisma:83-103`) that nothing enforces. Two sources of "what is a role".	Med	Tech-debt and future confusion. Pick one. ABAC needs more than 3 coarse roles eventually.
Maintenance	Middleware coverage is an explicit allow-list of 4 controllers (`app.module.ts:72-79`). New controllers get no session/tenant context unless manually added.	Med	Fail-closed (AuthGuard 401s without `req.user`), so it's safe-by-accident, but it's a footgun: a new money-moving controller silently 401s until someone notices, and a `@Public()` mistake on it bypasses tenancy entirely.
Ecosystem	Better Auth is young and fast-moving; breaking changes between minor versions are common. Plugin API (organization, phoneNumber, twoFactor) is where churn concentrates.	Med	You already pin behaviour assumptions (e.g. `Organization.id == Business.id` invariant relies on not using the plugin's create-org endpoint). Upgrades need a regression suite.
Vendor	Open-source, self-hosted — low lock-in on hosting, but schema + session-shape lock-in is real: your `Session`, `Account`, `Member`, `Verification`, `TwoFactor` tables follow Better Auth's expected shapes.	Med	This is the dimension the abstraction layer (§6) directly addresses.
Scaling	Opaque DB sessions don't federate across regions without a shared/replicated session store.	Med–High (future)	The most likely migration trigger.
Compliance	No dedicated, immutable, queryable auth event log (logins, MFA changes, role grants, admin actions, impersonation). Session table is mutable and cascades.	High	Regulators expect "who authenticated, when, from where, and who granted what role" as tamper-evident history. `Session.ipAddress`/`userAgent` exist but are mutable and not an audit trail.

3. Real risks of REWRITING auth ourselves (Option C)

This is where fintechs bleed out. Brutally honest accounting:

Class	Risk	Reality
Security	You will reintroduce bugs the ecosystem solved years ago: timing-unsafe token compare, weak password hashing params, CSRF on auth endpoints, session fixation, OTP enumeration.	The current code already gets several of these right by inheriting Better Auth (CSRF via `trustedOrigins`, `timingSafeEqual` in HMAC). A rewrite re-opens all of them.
Implementation complexity	Sign-up, sign-in, email verify, password reset, OTP, session lifecycle, refresh/sliding, multi-org membership, invitations — that's 6–9 months of one senior engineer to reach parity, before MFA/OIDC.	Parity, not advantage. You'd spend a half-year to arrive where you already are.
Hidden edge cases	Concurrent session refresh, clock skew on tokens, partial email-verify states, OTP race on resend, invitation/accept races, org-switch mid-session, password-reset token reuse.	These are exactly the bugs that cause account takeover. Each is a CVE waiting in custom code.
Long-term maintenance	You now own a security-critical product forever. Every dependency CVE, every new MFA method, every OIDC quirk is your on-call.	This is a permanent tax on a team that should be building payroll/EWA/lending.
Audit/compliance burden	You must prove your hand-rolled crypto + session handling to auditors with no third-party attestation to lean on.	"We built our own auth" is a red flag to bank TPRM teams, not a selling point.
Session management	Revocation, device listing, idle vs absolute timeout, "log out everywhere", step-up — all bespoke.	Better Auth gives you opaque-token revocation for free.
MFA	TOTP enrolment, backup codes, recovery, SMS fallback, WebAuthn later.	The single hardest thing to get right. Do not hand-roll.
Password reset	The #1 account-takeover vector industry-wide. Token entropy, single-use, expiry, enumeration resistance, rate limiting.	Each detail is an incident if wrong.
OAuth/OIDC	If you ever federate (employer SSO, gov ID), you become an OIDC RP/OP. The spec is a minefield (nonce, state, PKCE, token validation, JWKS rotation).	This alone justifies never building custom — buy/adopt an OP.
Token revocation	If you move to JWTs for performance, revocation needs a denylist or short TTL + refresh. Custom = subtle bugs.	Better Auth's opaque sessions sidestep this entirely.
Insider attack surface	Custom admin tooling around auth tends to grow god-mode endpoints.	Larger bespoke surface = larger insider risk.
Tenancy leakage	One missing `tenant_id` filter in a hand-rolled query = cross-employer data breach.	You currently get defense-in-depth from Postgres RLS. A rewrite must not regress that.

Verdict: rewriting auth from scratch is not justified. There is no requirement here that a rewrite satisfies and a managed/adopted IAM does not. Reserve "custom" for the thin glue (the abstraction layer), never the primitives.

5. Current implementation review — module classification

Classification key: SAFE (sound, keep) · PARTIAL (works, has gaps) · DANGEROUS (active risk) · STUB (placeholder) · TECH-DEBT (works, should be removed/consolidated).

Module / concern	File	Class	Finding
AuthGuard (global authn gate)	`auth/auth.guard.ts`	✅ SAFE	Fail-closed, default-deny, `@Public()` opt-out, cheap (`req.user` check only). Textbook.
SessionMiddleware	`auth/session.middleware.ts`	🟡 PARTIAL	Correctly non-enforcing + fail-open-on-error-but-guard-fails-closed. Gap: per-request DB `getSession()`, no cache. `role` correctly removed from `req.user`.
OrgRoleGuard (RBAC)	`auth/org-role.guard.ts`	🟡 PARTIAL	Single source of role truth (`Member.role`), headers never consulted, well-logged with masked IDs. Gaps: (a) 2 DB queries/request, no cache; (b) platform-admin bypasses all org checks → very broad; (c) only 3 coarse roles.
TenantContextMiddleware	`tenant/tenant-context.middleware.ts`	✅ SAFE	Rejects `x-actor-id`/`x-tenant-id` as identity (A1/A4), tenant+actor from session only, fail-closed via ALS. Strong.
Tenant isolation (RLS)	`prisma/schema.prisma:128-145`, `RLS.md`	✅ SAFE	Financial tables RLS-forced; identity tables intentionally cross-tenant. Defense-in-depth done right.
Role model	`schema.prisma` + `shared-auth/rbac.ts`	🟠 TECH-DEBT	Dual model: active `Member.role` (3 strings) vs legacy `Role`+`permissions Json` (defined, unenforced). Consolidate. Coarse roles won't survive payroll/lending separation-of-duties needs.
Org model	`schema.prisma:241-279`	✅ SAFE	`Organization.id == Business.id` shared-PK invariant is elegant and makes session→tenant a direct map. Documented well. Keep.
`req.user` usage	`session.middleware.ts:19-24`	✅ SAFE	Minimal stable contract (`id`, `email`, `emailVerified`, `businessId`). No role smuggled in. Good.
Actor attribution	ALS via `runWithTenant`	✅ SAFE	Actor = session `User.id`, never a header. Correct for audit. Gap: see auth-event-log compliance hole (no immutable login/role-grant trail).
Auth guards registration	`auth/auth.module.ts`	✅ SAFE	AuthGuard then OrgRoleGuard ordering via two `APP_GUARD`s is correct and documented.
Permission enforcement	controllers + `rbac.ts`	🟡 PARTIAL	Declarative `@RequireOrgRole`/`@RequirePlatformAdmin` is clean. Gap: enforcement is per-route opt-in; an undecorated protected route is "any authenticated user" — easy to forget on a money route.
Better Auth integration / mount	`main.ts:39-42`, `better-auth.factory.ts`	🟠 DANGEROUS	Auth subtree on raw Express outside Nest pipeline → no shared rate limiting / validation / tracing on login, OTP, reset. Highest-traffic-attack endpoints are the least protected.
Phone OTP sender	`auth/sms-sender.ts`	🔴 STUB	`LoggingSmsSender` only. Prod drops OTPs. No real provider. Primary market auth path is non-functional + unmonitored in prod.
2FA	`schema.prisma:224-234` (`TwoFactor`), factory	🔴 STUB	Table exists; plugin not wired in `better-auth.factory.ts` (only `organization` + `phoneNumber`). MFA is effectively absent platform-wide, including for admins.
Platform-admin auth	`org-role.guard.ts:85-129`, `AdminProfile`	🟠 DANGEROUS	Cross-tenant superuser via a single `checkType==='PLATFORM'` row, no MFA, no step-up, no separate realm, no impersonation audit. Single phished password = total breach.
Bank webhook auth (HMAC)	`integration/bank-webhook.controller.ts`, `hmac.ts`	🟡 PARTIAL	Fail-closed on missing key, raw-body HMAC, `timingSafeEqual`, 5-min skew, 401-without-reason. Solid. Gap: nonce is signed but never stored/checked → replay within the 5-min window succeeds.
Service-to-service (gRPC) auth	`ewa/ledger.grpc-client.ts:102`, `ewa/integration-gateway.grpc-client.ts:142`	🔴 DANGEROUS	`credentials.createInsecure()` — no TLS, no mTLS, no service token between API ↔ ledger ↔ gateway. The money-truth tier has no authenticated channel. Top finding.
Legacy JWT primitives	`shared/auth` package	🟠 TECH-DEBT	`FakeJwtVerifier`/`AuthClaims` marked deprecated; `shared/auth` flagged `Deprecated` in CLAUDE.md but still exports live RBAC decorators. Split: keep `rbac.ts`, delete `auth.ts`.
Webhook controller as `@Public()`	`bank-webhook.controller.ts:47`	🟡 PARTIAL	Correct (signature is the auth) but it's outside the tenant middleware allow-list and self-asserts tenant from payload — acceptable given HMAC, but the trust boundary should be documented as "partner-asserted tenant, verified by signature + reference lookup".

Dangerous anti-patterns called out explicitly

createInsecure() on the ledger/gateway gRPC channels — unauthenticated, unencrypted money-tier RPC.
Platform-admin = unconditional cross-tenant bypass with no MFA — org-role.guard.ts:94-129.
Auth endpoints outside the application security pipeline — main.ts:41-42.
MFA table without MFA plugin — false sense of "2FA exists".
Replayable signed webhook — nonce signed, never enforced for uniqueness.
Per-route opt-in RBAC on money endpoints — forgetting a decorator silently downgrades to "any authenticated user".

Trust boundaries (current)

                          ┌─────────────────────── UNTRUSTED ───────────────────────┐
  Browser / mobile  ─────▶│  /api/auth/*   (RAW EXPRESS, pre-Nest)                   │
  (employee/employer)     │   └─ Better Auth handler: signup/in/out, OTP, session    │  ← no shared rate-limit / pipe / trace
                          └──────────────────────────────┬──────────────────────────┘
                                                          │ Set-Cookie (opaque session token)
  Browser / mobile  ─────▶ /api/* (Nest)                  ▼
       │ cookie            ┌──────────────── TRUST BOUNDARY: session ────────────────┐
       │                   │ SessionMiddleware  → getSession() → req.user            │  (DB hit/req)
       │                   │ TenantContextMiddleware → ALS{tenantId,actorId}         │  (headers rejected)
       │                   │ AuthGuard (authn) → OrgRoleGuard (authz, Member/Admin)  │
       │                   └──────────────────────────────┬──────────────────────────┘
       │                                                  ▼
       │                                   Domain services (EWA / Lending / Business / Employee)
       │                                                  │
       │                                   Postgres  ◀── RLS (app.tenant_id) ── financial tables
       │                                                  │
       │                     ┌──────────── WEAK BOUNDARY: S2S (NO AUTH/TLS) ─────────┐
       │                     │  gRPC createInsecure()                                 │
       │                     ▼                                                        ▼
   Partner bank ─HMAC▶ integration-gateway (Go)                                  ledger (Go)
        ▲   webhook         │  ← HMAC, 5-min skew, NONCE NOT ENFORCED (replayable)
        └────────────────────┘

Boundaries that hold: browser→Nest session, tenant isolation at DB, header-spoofing rejection, partner→API signature. Boundaries that leak: auth endpoints below the Nest controls; API↔Go services (no identity/TLS); admin realm (no MFA); webhook replay window.

See AUTH_RISK_MATRIX.md for likelihood × impact scoring and remediation ownership, and LONG_TERM_IAM_ARCHITECTURE.md for the target design.

0. TL;DR for the busy CTO​

1. Is Better Auth acceptable for…?​

2. Real risks of CONTINUING with Better Auth​

3. Real risks of REWRITING auth ourselves (Option C)​

5. Current implementation review — module classification​

Dangerous anti-patterns called out explicitly​

Trust boundaries (current)​