AUTH_SYSTEM_REVIEW.md — DemozPay Identity & Access Architecture Review
⚠️ STALE (2026-05-29 snapshot). Several of this review's scariest findings have since been built and are no longer accurate: MFA/TOTP (
better-auth.factory.ts+AdminMfaGuard), rate limiting on auth endpoints (auth-rate-limit.ts+ spec), and the append-only auth event log (prisma-auth-event-sink.ts) are all Live; cross-kind SoD and platform-scoped roles also shipped. The current real gaps are elsewhere — thin test coverage of the RBAC enforcement core (plan E1) and theSessionMiddleware.forRoutesdrift (plan A5). Verify any "not implemented" claim below against code before acting.Status:
Review— architecture-grade analysis (2026-05-29; partially overtaken — see banner). Reviewer roles: Principal Security Architect · Fintech CTO · IAM reviewer · Multi-tenant platform architect. Scope: the auth stack as it existed onrestructure/modular-monolith-scaffold(2026-05-29). Companion docs:AUTH_RISK_MATRIX.md·LONG_TERM_IAM_ARCHITECTURE.md·AUTH_MIGRATION_STRATEGY.md
0. TL;DR for the busy CTO
Better Auth is not your problem. Your problem is everything around it.
The library choice (Better Auth) is acceptable for pilot and early production, and the integration is, frankly, better than most fintech codebases at this stage — session-derived identity, headers rejected as identity sources, RLS at the financial tier, single source of role truth on Member, fail-closed guards. Whoever did the A1/A2/A4 hardening sweep understood the threat model.
But there are five real risks that have nothing to do with which auth library you picked, and four of them will hurt you regardless of whether you keep, fork, or replace Better Auth:
- Service-to-service auth is absent. API → Go ledger and API → integration-gateway use
credentials.createInsecure(). No TLS, no mTLS, no service identity. This is the single most serious finding. A regulator or partner-bank security review fails here, not at Better Auth. - Admin / platform-operator auth is a single point of total compromise.
AdminProfile.checkType === 'PLATFORM'grants cross-tenant superuser with no MFA requirement, no step-up auth, no separate realm. One phished operator password = full platform breach across every employer's payroll. - The primary auth path for the target market (phone OTP) is not production-ready.
LoggingSmsSenderdrops OTPs in prod. There is no rate limiting / brute-force lockout wired on OTP or password endpoints. - The auth subtree bypasses the Nest pipeline.
/api/auth/*is mounted on raw Express beforesetGlobalPrefix, so it sits outside Nest guards, validation pipes, rate limiters, and observability interceptors. Whatever protection you add at the Nest layer does not protect login. - Webhook HMAC has a replay window. The nonce is signed but never stored/checked, so a captured signed webhook is replayable for 5 minutes.
Recommendation in one line: Keep Better Auth, isolate it behind a thin internal IAM abstraction now (so it is replaceable later without a domain rewrite), and spend your auth budget on the five gaps above — not on a rewrite. Re-evaluate a move to a dedicated IAM (Ory/Keycloak) at the scale-up gate, triggered by concrete events (first partner-bank security questionnaire, first multi-region requirement, OIDC-federation demand), not by speculation.
1. Is Better Auth acceptable for…?
| Context | Verdict | Why |
|---|---|---|
| Pilot | ✅ Yes | Session model, org/member, RLS-backed tenancy are sound. The gaps below are pilot-tolerable if explicitly risk-accepted and the pilot is low-volume / known employers. |
| Production (early, single region, <50 employers) | 🟡 Conditional | Acceptable only after closing: S2S auth, admin MFA, OTP provider + rate limiting. The library is fine; the surrounding controls are not yet. |
| Enterprise fintech scale | 🟡 Conditional, with caveats | Per-request getSession() DB hit and per-RBAC-request AdminProfile+Member lookups will need caching. The org/member model is coarse (3 roles). Workable but you will be hardening Better Auth, not leaning on it. |
| Regulated banking integrations | 🟠 Not yet | Not a Better Auth deficiency — it's the absence of M2M/service identity, admin MFA, and immutable auth audit trails. A bank's TPRM/security review fails on S2S createInsecure() and admin-without-MFA today. |
| Multi-bank future | 🟡 Neutral | Better Auth is irrelevant here. Multi-bank is an integration-gateway + partner-credential-vault problem (see LONG_TERM_IAM §M2M / partner auth). Better Auth neither helps nor hurts. |
| Multi-region future | 🟠 Becomes a constraint | DB-backed opaque sessions mean session lookups must hit a region-local replica or a shared session store; Better Auth's session model is single-writer-DB oriented. Solvable, but this is the most likely forcing function for a future IAM migration. |
Bottom line: Better Auth clears pilot and early-production bars. The bars it does not clear are bars that custom auth and Better Auth fail equally — they are control gaps, not library gaps.
2. Real risks of CONTINUING with Better Auth
| Class | Risk | Severity | Notes |
|---|---|---|---|
| Operational | Per-request auth.api.getSession() hits Postgres on every protected route (session.middleware.ts:44). No session cache. | Med | Becomes a latency + DB-connection bottleneck at scale. Mitigate with a short-TTL session cache (Redis) behind the abstraction. |
| Operational | Auth subtree mounted on raw Express (main.ts:41-42) — no shared rate limiting, validation, or tracing on login/OTP/reset. | High | You cannot add a global Nest control and assume login is covered. Every protection must be added at the Better Auth config or Express layer explicitly. |
| Security | No rate limiting / lockout on password + OTP endpoints observed. OTP brute-force + credential stuffing are open. | High | Better Auth supports rate limiting; it is not configured here. This is the highest-likelihood real-world attack. |
| Security | Phone OTP is the stated primary path but LoggingSmsSender drops codes in prod (sms-sender.ts:30). Users either can't log in or you ship with a placeholder. | High | Functional + security gap: shipping phone-OTP without a real provider and without rate limiting is worse than not shipping it. |
| Security | requireEmailVerification only true in production (better-auth.factory.ts:46) — fine — but no breach-password check, no complexity policy beyond min-length 8. | Low–Med | Acceptable for pilot; tighten before scale. |
| Maintenance | Dual role model: Member.role (owner/admin/member) and legacy Role+permissions Json table (schema.prisma:83-103) that nothing enforces. Two sources of "what is a role". | Med | Tech-debt and future confusion. Pick one. ABAC needs more than 3 coarse roles eventually. |
| Maintenance | Middleware coverage is an explicit allow-list of 4 controllers (app.module.ts:72-79). New controllers get no session/tenant context unless manually added. | Med | Fail-closed (AuthGuard 401s without req.user), so it's safe-by-accident, but it's a footgun: a new money-moving controller silently 401s until someone notices, and a @Public() mistake on it bypasses tenancy entirely. |
| Ecosystem | Better Auth is young and fast-moving; breaking changes between minor versions are common. Plugin API (organization, phoneNumber, twoFactor) is where churn concentrates. | Med | You already pin behaviour assumptions (e.g. Organization.id == Business.id invariant relies on not using the plugin's create-org endpoint). Upgrades need a regression suite. |
| Vendor | Open-source, self-hosted — low lock-in on hosting, but schema + session-shape lock-in is real: your Session, Account, Member, Verification, TwoFactor tables follow Better Auth's expected shapes. | Med | This is the dimension the abstraction layer (§6) directly addresses. |
| Scaling | Opaque DB sessions don't federate across regions without a shared/replicated session store. | Med–High (future) | The most likely migration trigger. |
| Compliance | No dedicated, immutable, queryable auth event log (logins, MFA changes, role grants, admin actions, impersonation). Session table is mutable and cascades. | High | Regulators expect "who authenticated, when, from where, and who granted what role" as tamper-evident history. Session.ipAddress/userAgent exist but are mutable and not an audit trail. |
3. Real risks of REWRITING auth ourselves (Option C)
This is where fintechs bleed out. Brutally honest accounting:
| Class | Risk | Reality |
|---|---|---|
| Security | You will reintroduce bugs the ecosystem solved years ago: timing-unsafe token compare, weak password hashing params, CSRF on auth endpoints, session fixation, OTP enumeration. | The current code already gets several of these right by inheriting Better Auth (CSRF via trustedOrigins, timingSafeEqual in HMAC). A rewrite re-opens all of them. |
| Implementation complexity | Sign-up, sign-in, email verify, password reset, OTP, session lifecycle, refresh/sliding, multi-org membership, invitations — that's 6–9 months of one senior engineer to reach parity, before MFA/OIDC. | Parity, not advantage. You'd spend a half-year to arrive where you already are. |
| Hidden edge cases | Concurrent session refresh, clock skew on tokens, partial email-verify states, OTP race on resend, invitation/accept races, org-switch mid-session, password-reset token reuse. | These are exactly the bugs that cause account takeover. Each is a CVE waiting in custom code. |
| Long-term maintenance | You now own a security-critical product forever. Every dependency CVE, every new MFA method, every OIDC quirk is your on-call. | This is a permanent tax on a team that should be building payroll/EWA/lending. |
| Audit/compliance burden | You must prove your hand-rolled crypto + session handling to auditors with no third-party attestation to lean on. | "We built our own auth" is a red flag to bank TPRM teams, not a selling point. |
| Session management | Revocation, device listing, idle vs absolute timeout, "log out everywhere", step-up — all bespoke. | Better Auth gives you opaque-token revocation for free. |
| MFA | TOTP enrolment, backup codes, recovery, SMS fallback, WebAuthn later. | The single hardest thing to get right. Do not hand-roll. |
| Password reset | The #1 account-takeover vector industry-wide. Token entropy, single-use, expiry, enumeration resistance, rate limiting. | Each detail is an incident if wrong. |
| OAuth/OIDC | If you ever federate (employer SSO, gov ID), you become an OIDC RP/OP. The spec is a minefield (nonce, state, PKCE, token validation, JWKS rotation). | This alone justifies never building custom — buy/adopt an OP. |
| Token revocation | If you move to JWTs for performance, revocation needs a denylist or short TTL + refresh. Custom = subtle bugs. | Better Auth's opaque sessions sidestep this entirely. |
| Insider attack surface | Custom admin tooling around auth tends to grow god-mode endpoints. | Larger bespoke surface = larger insider risk. |
| Tenancy leakage | One missing tenant_id filter in a hand-rolled query = cross-employer data breach. | You currently get defense-in-depth from Postgres RLS. A rewrite must not regress that. |
Verdict: rewriting auth from scratch is not justified. There is no requirement here that a rewrite satisfies and a managed/adopted IAM does not. Reserve "custom" for the thin glue (the abstraction layer), never the primitives.
5. Current implementation review — module classification
Classification key: SAFE (sound, keep) · PARTIAL (works, has gaps) · DANGEROUS (active risk) · STUB (placeholder) · TECH-DEBT (works, should be removed/consolidated).
| Module / concern | File | Class | Finding |
|---|---|---|---|
| AuthGuard (global authn gate) | auth/auth.guard.ts | ✅ SAFE | Fail-closed, default-deny, @Public() opt-out, cheap (req.user check only). Textbook. |
| SessionMiddleware | auth/session.middleware.ts | 🟡 PARTIAL | Correctly non-enforcing + fail-open-on-error-but-guard-fails-closed. Gap: per-request DB getSession(), no cache. role correctly removed from req.user. |
| OrgRoleGuard (RBAC) | auth/org-role.guard.ts | 🟡 PARTIAL | Single source of role truth (Member.role), headers never consulted, well-logged with masked IDs. Gaps: (a) 2 DB queries/request, no cache; (b) platform-admin bypasses all org checks → very broad; (c) only 3 coarse roles. |
| TenantContextMiddleware | tenant/tenant-context.middleware.ts | ✅ SAFE | Rejects x-actor-id/x-tenant-id as identity (A1/A4), tenant+actor from session only, fail-closed via ALS. Strong. |
| Tenant isolation (RLS) | prisma/schema.prisma:128-145, RLS.md | ✅ SAFE | Financial tables RLS-forced; identity tables intentionally cross-tenant. Defense-in-depth done right. |
| Role model | schema.prisma + shared-auth/rbac.ts | 🟠 TECH-DEBT | Dual model: active Member.role (3 strings) vs legacy Role+permissions Json (defined, unenforced). Consolidate. Coarse roles won't survive payroll/lending separation-of-duties needs. |
| Org model | schema.prisma:241-279 | ✅ SAFE | Organization.id == Business.id shared-PK invariant is elegant and makes session→tenant a direct map. Documented well. Keep. |
req.user usage | session.middleware.ts:19-24 | ✅ SAFE | Minimal stable contract (id, email, emailVerified, businessId). No role smuggled in. Good. |
| Actor attribution | ALS via runWithTenant | ✅ SAFE | Actor = session User.id, never a header. Correct for audit. Gap: see auth-event-log compliance hole (no immutable login/role-grant trail). |
| Auth guards registration | auth/auth.module.ts | ✅ SAFE | AuthGuard then OrgRoleGuard ordering via two APP_GUARDs is correct and documented. |
| Permission enforcement | controllers + rbac.ts | 🟡 PARTIAL | Declarative @RequireOrgRole/@RequirePlatformAdmin is clean. Gap: enforcement is per-route opt-in; an undecorated protected route is "any authenticated user" — easy to forget on a money route. |
| Better Auth integration / mount | main.ts:39-42, better-auth.factory.ts | 🟠 DANGEROUS | Auth subtree on raw Express outside Nest pipeline → no shared rate limiting / validation / tracing on login, OTP, reset. Highest-traffic-attack endpoints are the least protected. |
| Phone OTP sender | auth/sms-sender.ts | 🔴 STUB | LoggingSmsSender only. Prod drops OTPs. No real provider. Primary market auth path is non-functional + unmonitored in prod. |
| 2FA | schema.prisma:224-234 (TwoFactor), factory | 🔴 STUB | Table exists; plugin not wired in better-auth.factory.ts (only organization + phoneNumber). MFA is effectively absent platform-wide, including for admins. |
| Platform-admin auth | org-role.guard.ts:85-129, AdminProfile | 🟠 DANGEROUS | Cross-tenant superuser via a single checkType==='PLATFORM' row, no MFA, no step-up, no separate realm, no impersonation audit. Single phished password = total breach. |
| Bank webhook auth (HMAC) | integration/bank-webhook.controller.ts, hmac.ts | 🟡 PARTIAL | Fail-closed on missing key, raw-body HMAC, timingSafeEqual, 5-min skew, 401-without-reason. Solid. Gap: nonce is signed but never stored/checked → replay within the 5-min window succeeds. |
| Service-to-service (gRPC) auth | ewa/ledger.grpc-client.ts:102, ewa/integration-gateway.grpc-client.ts:142 | 🔴 DANGEROUS | credentials.createInsecure() — no TLS, no mTLS, no service token between API ↔ ledger ↔ gateway. The money-truth tier has no authenticated channel. Top finding. |
| Legacy JWT primitives | shared/auth package | 🟠 TECH-DEBT | FakeJwtVerifier/AuthClaims marked deprecated; shared/auth flagged Deprecated in CLAUDE.md but still exports live RBAC decorators. Split: keep rbac.ts, delete auth.ts. |
Webhook controller as @Public() | bank-webhook.controller.ts:47 | 🟡 PARTIAL | Correct (signature is the auth) but it's outside the tenant middleware allow-list and self-asserts tenant from payload — acceptable given HMAC, but the trust boundary should be documented as "partner-asserted tenant, verified by signature + reference lookup". |
Dangerous anti-patterns called out explicitly
createInsecure()on the ledger/gateway gRPC channels — unauthenticated, unencrypted money-tier RPC.- Platform-admin = unconditional cross-tenant bypass with no MFA —
org-role.guard.ts:94-129. - Auth endpoints outside the application security pipeline —
main.ts:41-42. - MFA table without MFA plugin — false sense of "2FA exists".
- Replayable signed webhook — nonce signed, never enforced for uniqueness.
- Per-route opt-in RBAC on money endpoints — forgetting a decorator silently downgrades to "any authenticated user".
Trust boundaries (current)
┌─────────────────────── UNTRUSTED ───────────────────────┐
Browser / mobile ─────▶│ /api/auth/* (RAW EXPRESS, pre-Nest) │
(employee/employer) │ └─ Better Auth handler: signup/in/out, OTP, session │ ← no shared rate-limit / pipe / trace
└──────────────────────────────┬──────────────────────────┘
│ Set-Cookie (opaque session token)
Browser / mobile ─────▶ /api/* (Nest) ▼
│ cookie ┌──────────────── TRUST BOUNDARY: session ────────────────┐
│ │ SessionMiddleware → getSession() → req.user │ (DB hit/req)
│ │ TenantContextMiddleware → ALS{tenantId,actorId} │ (headers rejected)
│ │ AuthGuard (authn) → OrgRoleGuard (authz, Member/Admin) │
│ └──────────────────────────────┬──────────────────────────┘
│ ▼
│ Domain services (EWA / Lending / Business / Employee)
│ │
│ Postgres ◀── RLS (app.tenant_id) ── financial tables
│ │
│ ┌──────────── WEAK BOUNDARY: S2S (NO AUTH/TLS) ─────────┐
│ │ gRPC createInsecure() │
│ ▼ ▼
Partner bank ─HMAC▶ integration-gateway (Go) ledger (Go)
▲ webhook │ ← HMAC, 5-min skew, NONCE NOT ENFORCED (replayable)
└────────────────────┘
Boundaries that hold: browser→Nest session, tenant isolation at DB, header-spoofing rejection, partner→API signature. Boundaries that leak: auth endpoints below the Nest controls; API↔Go services (no identity/TLS); admin realm (no MFA); webhook replay window.
See AUTH_RISK_MATRIX.md for likelihood × impact scoring and remediation ownership, and LONG_TERM_IAM_ARCHITECTURE.md for the target design.