Skip to main content

AUTH_SYSTEM_REVIEW.md — DemozPay Identity & Access Architecture Review

⚠️ STALE (2026-05-29 snapshot). Several of this review's scariest findings have since been built and are no longer accurate: MFA/TOTP (better-auth.factory.ts + AdminMfaGuard), rate limiting on auth endpoints (auth-rate-limit.ts + spec), and the append-only auth event log (prisma-auth-event-sink.ts) are all Live; cross-kind SoD and platform-scoped roles also shipped. The current real gaps are elsewhere — thin test coverage of the RBAC enforcement core (plan E1) and the SessionMiddleware.forRoutes drift (plan A5). Verify any "not implemented" claim below against code before acting.

Status: Review — architecture-grade analysis (2026-05-29; partially overtaken — see banner). Reviewer roles: Principal Security Architect · Fintech CTO · IAM reviewer · Multi-tenant platform architect. Scope: the auth stack as it existed on restructure/modular-monolith-scaffold (2026-05-29). Companion docs: AUTH_RISK_MATRIX.md · LONG_TERM_IAM_ARCHITECTURE.md · AUTH_MIGRATION_STRATEGY.md


0. TL;DR for the busy CTO

Better Auth is not your problem. Your problem is everything around it.

The library choice (Better Auth) is acceptable for pilot and early production, and the integration is, frankly, better than most fintech codebases at this stage — session-derived identity, headers rejected as identity sources, RLS at the financial tier, single source of role truth on Member, fail-closed guards. Whoever did the A1/A2/A4 hardening sweep understood the threat model.

But there are five real risks that have nothing to do with which auth library you picked, and four of them will hurt you regardless of whether you keep, fork, or replace Better Auth:

  1. Service-to-service auth is absent. API → Go ledger and API → integration-gateway use credentials.createInsecure(). No TLS, no mTLS, no service identity. This is the single most serious finding. A regulator or partner-bank security review fails here, not at Better Auth.
  2. Admin / platform-operator auth is a single point of total compromise. AdminProfile.checkType === 'PLATFORM' grants cross-tenant superuser with no MFA requirement, no step-up auth, no separate realm. One phished operator password = full platform breach across every employer's payroll.
  3. The primary auth path for the target market (phone OTP) is not production-ready. LoggingSmsSender drops OTPs in prod. There is no rate limiting / brute-force lockout wired on OTP or password endpoints.
  4. The auth subtree bypasses the Nest pipeline. /api/auth/* is mounted on raw Express before setGlobalPrefix, so it sits outside Nest guards, validation pipes, rate limiters, and observability interceptors. Whatever protection you add at the Nest layer does not protect login.
  5. Webhook HMAC has a replay window. The nonce is signed but never stored/checked, so a captured signed webhook is replayable for 5 minutes.

Recommendation in one line: Keep Better Auth, isolate it behind a thin internal IAM abstraction now (so it is replaceable later without a domain rewrite), and spend your auth budget on the five gaps above — not on a rewrite. Re-evaluate a move to a dedicated IAM (Ory/Keycloak) at the scale-up gate, triggered by concrete events (first partner-bank security questionnaire, first multi-region requirement, OIDC-federation demand), not by speculation.


1. Is Better Auth acceptable for…?

ContextVerdictWhy
PilotYesSession model, org/member, RLS-backed tenancy are sound. The gaps below are pilot-tolerable if explicitly risk-accepted and the pilot is low-volume / known employers.
Production (early, single region, <50 employers)🟡 ConditionalAcceptable only after closing: S2S auth, admin MFA, OTP provider + rate limiting. The library is fine; the surrounding controls are not yet.
Enterprise fintech scale🟡 Conditional, with caveatsPer-request getSession() DB hit and per-RBAC-request AdminProfile+Member lookups will need caching. The org/member model is coarse (3 roles). Workable but you will be hardening Better Auth, not leaning on it.
Regulated banking integrations🟠 Not yetNot a Better Auth deficiency — it's the absence of M2M/service identity, admin MFA, and immutable auth audit trails. A bank's TPRM/security review fails on S2S createInsecure() and admin-without-MFA today.
Multi-bank future🟡 NeutralBetter Auth is irrelevant here. Multi-bank is an integration-gateway + partner-credential-vault problem (see LONG_TERM_IAM §M2M / partner auth). Better Auth neither helps nor hurts.
Multi-region future🟠 Becomes a constraintDB-backed opaque sessions mean session lookups must hit a region-local replica or a shared session store; Better Auth's session model is single-writer-DB oriented. Solvable, but this is the most likely forcing function for a future IAM migration.

Bottom line: Better Auth clears pilot and early-production bars. The bars it does not clear are bars that custom auth and Better Auth fail equally — they are control gaps, not library gaps.


2. Real risks of CONTINUING with Better Auth

ClassRiskSeverityNotes
OperationalPer-request auth.api.getSession() hits Postgres on every protected route (session.middleware.ts:44). No session cache.MedBecomes a latency + DB-connection bottleneck at scale. Mitigate with a short-TTL session cache (Redis) behind the abstraction.
OperationalAuth subtree mounted on raw Express (main.ts:41-42) — no shared rate limiting, validation, or tracing on login/OTP/reset.HighYou cannot add a global Nest control and assume login is covered. Every protection must be added at the Better Auth config or Express layer explicitly.
SecurityNo rate limiting / lockout on password + OTP endpoints observed. OTP brute-force + credential stuffing are open.HighBetter Auth supports rate limiting; it is not configured here. This is the highest-likelihood real-world attack.
SecurityPhone OTP is the stated primary path but LoggingSmsSender drops codes in prod (sms-sender.ts:30). Users either can't log in or you ship with a placeholder.HighFunctional + security gap: shipping phone-OTP without a real provider and without rate limiting is worse than not shipping it.
SecurityrequireEmailVerification only true in production (better-auth.factory.ts:46) — fine — but no breach-password check, no complexity policy beyond min-length 8.Low–MedAcceptable for pilot; tighten before scale.
MaintenanceDual role model: Member.role (owner/admin/member) and legacy Role+permissions Json table (schema.prisma:83-103) that nothing enforces. Two sources of "what is a role".MedTech-debt and future confusion. Pick one. ABAC needs more than 3 coarse roles eventually.
MaintenanceMiddleware coverage is an explicit allow-list of 4 controllers (app.module.ts:72-79). New controllers get no session/tenant context unless manually added.MedFail-closed (AuthGuard 401s without req.user), so it's safe-by-accident, but it's a footgun: a new money-moving controller silently 401s until someone notices, and a @Public() mistake on it bypasses tenancy entirely.
EcosystemBetter Auth is young and fast-moving; breaking changes between minor versions are common. Plugin API (organization, phoneNumber, twoFactor) is where churn concentrates.MedYou already pin behaviour assumptions (e.g. Organization.id == Business.id invariant relies on not using the plugin's create-org endpoint). Upgrades need a regression suite.
VendorOpen-source, self-hosted — low lock-in on hosting, but schema + session-shape lock-in is real: your Session, Account, Member, Verification, TwoFactor tables follow Better Auth's expected shapes.MedThis is the dimension the abstraction layer (§6) directly addresses.
ScalingOpaque DB sessions don't federate across regions without a shared/replicated session store.Med–High (future)The most likely migration trigger.
ComplianceNo dedicated, immutable, queryable auth event log (logins, MFA changes, role grants, admin actions, impersonation). Session table is mutable and cascades.HighRegulators expect "who authenticated, when, from where, and who granted what role" as tamper-evident history. Session.ipAddress/userAgent exist but are mutable and not an audit trail.

3. Real risks of REWRITING auth ourselves (Option C)

This is where fintechs bleed out. Brutally honest accounting:

ClassRiskReality
SecurityYou will reintroduce bugs the ecosystem solved years ago: timing-unsafe token compare, weak password hashing params, CSRF on auth endpoints, session fixation, OTP enumeration.The current code already gets several of these right by inheriting Better Auth (CSRF via trustedOrigins, timingSafeEqual in HMAC). A rewrite re-opens all of them.
Implementation complexitySign-up, sign-in, email verify, password reset, OTP, session lifecycle, refresh/sliding, multi-org membership, invitations — that's 6–9 months of one senior engineer to reach parity, before MFA/OIDC.Parity, not advantage. You'd spend a half-year to arrive where you already are.
Hidden edge casesConcurrent session refresh, clock skew on tokens, partial email-verify states, OTP race on resend, invitation/accept races, org-switch mid-session, password-reset token reuse.These are exactly the bugs that cause account takeover. Each is a CVE waiting in custom code.
Long-term maintenanceYou now own a security-critical product forever. Every dependency CVE, every new MFA method, every OIDC quirk is your on-call.This is a permanent tax on a team that should be building payroll/EWA/lending.
Audit/compliance burdenYou must prove your hand-rolled crypto + session handling to auditors with no third-party attestation to lean on."We built our own auth" is a red flag to bank TPRM teams, not a selling point.
Session managementRevocation, device listing, idle vs absolute timeout, "log out everywhere", step-up — all bespoke.Better Auth gives you opaque-token revocation for free.
MFATOTP enrolment, backup codes, recovery, SMS fallback, WebAuthn later.The single hardest thing to get right. Do not hand-roll.
Password resetThe #1 account-takeover vector industry-wide. Token entropy, single-use, expiry, enumeration resistance, rate limiting.Each detail is an incident if wrong.
OAuth/OIDCIf you ever federate (employer SSO, gov ID), you become an OIDC RP/OP. The spec is a minefield (nonce, state, PKCE, token validation, JWKS rotation).This alone justifies never building custom — buy/adopt an OP.
Token revocationIf you move to JWTs for performance, revocation needs a denylist or short TTL + refresh. Custom = subtle bugs.Better Auth's opaque sessions sidestep this entirely.
Insider attack surfaceCustom admin tooling around auth tends to grow god-mode endpoints.Larger bespoke surface = larger insider risk.
Tenancy leakageOne missing tenant_id filter in a hand-rolled query = cross-employer data breach.You currently get defense-in-depth from Postgres RLS. A rewrite must not regress that.

Verdict: rewriting auth from scratch is not justified. There is no requirement here that a rewrite satisfies and a managed/adopted IAM does not. Reserve "custom" for the thin glue (the abstraction layer), never the primitives.


5. Current implementation review — module classification

Classification key: SAFE (sound, keep) · PARTIAL (works, has gaps) · DANGEROUS (active risk) · STUB (placeholder) · TECH-DEBT (works, should be removed/consolidated).

Module / concernFileClassFinding
AuthGuard (global authn gate)auth/auth.guard.tsSAFEFail-closed, default-deny, @Public() opt-out, cheap (req.user check only). Textbook.
SessionMiddlewareauth/session.middleware.ts🟡 PARTIALCorrectly non-enforcing + fail-open-on-error-but-guard-fails-closed. Gap: per-request DB getSession(), no cache. role correctly removed from req.user.
OrgRoleGuard (RBAC)auth/org-role.guard.ts🟡 PARTIALSingle source of role truth (Member.role), headers never consulted, well-logged with masked IDs. Gaps: (a) 2 DB queries/request, no cache; (b) platform-admin bypasses all org checks → very broad; (c) only 3 coarse roles.
TenantContextMiddlewaretenant/tenant-context.middleware.tsSAFERejects x-actor-id/x-tenant-id as identity (A1/A4), tenant+actor from session only, fail-closed via ALS. Strong.
Tenant isolation (RLS)prisma/schema.prisma:128-145, RLS.mdSAFEFinancial tables RLS-forced; identity tables intentionally cross-tenant. Defense-in-depth done right.
Role modelschema.prisma + shared-auth/rbac.ts🟠 TECH-DEBTDual model: active Member.role (3 strings) vs legacy Role+permissions Json (defined, unenforced). Consolidate. Coarse roles won't survive payroll/lending separation-of-duties needs.
Org modelschema.prisma:241-279SAFEOrganization.id == Business.id shared-PK invariant is elegant and makes session→tenant a direct map. Documented well. Keep.
req.user usagesession.middleware.ts:19-24SAFEMinimal stable contract (id, email, emailVerified, businessId). No role smuggled in. Good.
Actor attributionALS via runWithTenantSAFEActor = session User.id, never a header. Correct for audit. Gap: see auth-event-log compliance hole (no immutable login/role-grant trail).
Auth guards registrationauth/auth.module.tsSAFEAuthGuard then OrgRoleGuard ordering via two APP_GUARDs is correct and documented.
Permission enforcementcontrollers + rbac.ts🟡 PARTIALDeclarative @RequireOrgRole/@RequirePlatformAdmin is clean. Gap: enforcement is per-route opt-in; an undecorated protected route is "any authenticated user" — easy to forget on a money route.
Better Auth integration / mountmain.ts:39-42, better-auth.factory.ts🟠 DANGEROUSAuth subtree on raw Express outside Nest pipeline → no shared rate limiting / validation / tracing on login, OTP, reset. Highest-traffic-attack endpoints are the least protected.
Phone OTP senderauth/sms-sender.ts🔴 STUBLoggingSmsSender only. Prod drops OTPs. No real provider. Primary market auth path is non-functional + unmonitored in prod.
2FAschema.prisma:224-234 (TwoFactor), factory🔴 STUBTable exists; plugin not wired in better-auth.factory.ts (only organization + phoneNumber). MFA is effectively absent platform-wide, including for admins.
Platform-admin authorg-role.guard.ts:85-129, AdminProfile🟠 DANGEROUSCross-tenant superuser via a single checkType==='PLATFORM' row, no MFA, no step-up, no separate realm, no impersonation audit. Single phished password = total breach.
Bank webhook auth (HMAC)integration/bank-webhook.controller.ts, hmac.ts🟡 PARTIALFail-closed on missing key, raw-body HMAC, timingSafeEqual, 5-min skew, 401-without-reason. Solid. Gap: nonce is signed but never stored/checked → replay within the 5-min window succeeds.
Service-to-service (gRPC) authewa/ledger.grpc-client.ts:102, ewa/integration-gateway.grpc-client.ts:142🔴 DANGEROUScredentials.createInsecure() — no TLS, no mTLS, no service token between API ↔ ledger ↔ gateway. The money-truth tier has no authenticated channel. Top finding.
Legacy JWT primitivesshared/auth package🟠 TECH-DEBTFakeJwtVerifier/AuthClaims marked deprecated; shared/auth flagged Deprecated in CLAUDE.md but still exports live RBAC decorators. Split: keep rbac.ts, delete auth.ts.
Webhook controller as @Public()bank-webhook.controller.ts:47🟡 PARTIALCorrect (signature is the auth) but it's outside the tenant middleware allow-list and self-asserts tenant from payload — acceptable given HMAC, but the trust boundary should be documented as "partner-asserted tenant, verified by signature + reference lookup".

Dangerous anti-patterns called out explicitly

  1. createInsecure() on the ledger/gateway gRPC channels — unauthenticated, unencrypted money-tier RPC.
  2. Platform-admin = unconditional cross-tenant bypass with no MFAorg-role.guard.ts:94-129.
  3. Auth endpoints outside the application security pipelinemain.ts:41-42.
  4. MFA table without MFA plugin — false sense of "2FA exists".
  5. Replayable signed webhook — nonce signed, never enforced for uniqueness.
  6. Per-route opt-in RBAC on money endpoints — forgetting a decorator silently downgrades to "any authenticated user".

Trust boundaries (current)

┌─────────────────────── UNTRUSTED ───────────────────────┐
Browser / mobile ─────▶│ /api/auth/* (RAW EXPRESS, pre-Nest) │
(employee/employer) │ └─ Better Auth handler: signup/in/out, OTP, session │ ← no shared rate-limit / pipe / trace
└──────────────────────────────┬──────────────────────────┘
│ Set-Cookie (opaque session token)
Browser / mobile ─────▶ /api/* (Nest) ▼
│ cookie ┌──────────────── TRUST BOUNDARY: session ────────────────┐
│ │ SessionMiddleware → getSession() → req.user │ (DB hit/req)
│ │ TenantContextMiddleware → ALS{tenantId,actorId} │ (headers rejected)
│ │ AuthGuard (authn) → OrgRoleGuard (authz, Member/Admin) │
│ └──────────────────────────────┬──────────────────────────┘
│ ▼
│ Domain services (EWA / Lending / Business / Employee)
│ │
│ Postgres ◀── RLS (app.tenant_id) ── financial tables
│ │
│ ┌──────────── WEAK BOUNDARY: S2S (NO AUTH/TLS) ─────────┐
│ │ gRPC createInsecure() │
│ ▼ ▼
Partner bank ─HMAC▶ integration-gateway (Go) ledger (Go)
▲ webhook │ ← HMAC, 5-min skew, NONCE NOT ENFORCED (replayable)
└────────────────────┘

Boundaries that hold: browser→Nest session, tenant isolation at DB, header-spoofing rejection, partner→API signature. Boundaries that leak: auth endpoints below the Nest controls; API↔Go services (no identity/TLS); admin realm (no MFA); webhook replay window.

See AUTH_RISK_MATRIX.md for likelihood × impact scoring and remediation ownership, and LONG_TERM_IAM_ARCHITECTURE.md for the target design.