ADR-027: Redis usage posture — cache & ephemeral state only
- Status: Accepted
- Date: 2026-06-20
- Deciders: Principal Architect, Engineering Lead
- Relates to: ADR-006, ADR-017, ADR-023, ADR-026
Context
Redis is in the stack — redis:7-alpine in docker-compose.yml, an optional REDIS_URL in config, a health probe, and OpenTelemetry instrumentation — but its role has never been written down. Two risks follow from that silence:
- Misread as a ban. ADR-026 §3a says event dedup/DLQ live in Postgres, "not Redis." Read in isolation that sounds like a platform-wide prohibition. It is not — it is scoped to the event path.
- Misread as a spine. Equally, someone could reach for Redis as a store of record (sessions, locks, counters that gate money) and quietly introduce a non-durable dependency into a correctness path.
The actual footprint today (verified): exactly one functional consumer — the EWA eligibility cache (apps/api/src/products/ewa/eligibility-cache.ts, bound in ewa-api.module.ts), which is a non-authoritative UI estimate with an in-memory fallback. Everything else is plumbing around that: the health probe (_infra/health/indicators/redis.indicator.ts), the dependency-state metric, and OTel auto-instrumentation. Auth rate-limiting is deliberately in-process, not Redis (identity/auth/auth-rate-limit.ts). The app boots and is correct with REDIS_URL unset.
We need one rule that says what Redis is for, so both misreads stop.
Decision
Redis is the platform's optional cache + ephemeral-state tier. It is never a source of truth.
1. The litmus test. If losing a value on a Redis eviction, failover, or restart would cause incorrectness, double-money, or an unreconstructable state, it does not belong in Redis. It goes to Postgres (the ledger for money — ADR-006; the outbox/audit/processed_event for events — ADR-017/026).
2. Allowed uses (Redis is the right tool):
-
Read-through caches of values that are always recomputable from a system of record — e.g. the EWA eligibility estimate. A stale or missing entry must only cost a recompute, never correctness.
Implemented as a platform-wide tier:
_infra/cache/exposes oneCachePort(CACHEtoken,@GlobalCacheModule) —RedisCachewhenREDIS_URLis set (shared across all api instances),InMemoryCacheotherwise; fail-open (a down Redis becomes cache misses, never an error). Domains injectCACHEand namespace their keys (<context>:<purpose>:) rather than building their own Redis client. EWA's eligibility cache is the first consumer (CacheEligibilityCachedelegates to it). Verified against real Redis incl. the cross-instance shared property (_infra/cache/cache.integration.spec.ts, in CI). -
Rate-limiting at multi-instance scale — a shared
INCR+EXPIREcounter, if and only if we do not push rate-limiting to the edge first (see §4). -
Short-lived ephemeral throttles / nonces / advisory locks where loss is tolerable (degrades to "best-effort," never to "wrong").
3. Forbidden uses (Postgres instead):
- Money truth / balances (ADR-006), idempotency keys for money-moving POSTs (ADR-007), audit + outbox (ADR-008), Kafka consumer dedup / DLQ / effective-once (
processed_event, ADR-026 §3a). These are durable, transactional, and reconstructable — Redis is none of those.
4. Rate-limiting specifically. The production answer is the edge (WAF / API Gateway, consistent with ADR-023 "REST only at the edge"). In-process limiting is the pre-pilot single-instance floor and becomes best-effort-per-pod when we scale out. Redis-backed limiting is the fallback if the edge is not yet in place when we run multiple api instances — not the preferred design.
Implemented: identity/auth/redis-auth-rate-limit.ts — a shared-counter limiter (atomic Lua: INCR+PEXPIRE over the same burst+sustained windows as the in-process limiter). createAuthRateLimiter(REDIS_URL) returns it when REDIS_URL is set, else the in-process limiter; main.ts builds one per process. It fails open to the in-process floor if Redis errors (never 500s the auth path, never lets brute force through unbounded). Verified against real Redis incl. the cross-instance shared-counter property (redis-auth-rate-limit.integration.spec.ts, in CI).
5. Redis stays optional until something genuinely needs it. With REDIS_URL unset the app must boot and behave correctly (in-memory fallbacks). With REDIS_URL set, the health probe is CRITICAL (a configured-but-unreachable Redis is a readiness failure — silent cache bypass is acceptable, but a dependency you declared must be honest about being down).
Alternatives considered
- Ban Redis entirely, use Postgres for everything — rejected: caches and high-rate counters have a latency/contention profile that would put avoidable load on the primary DB at scale. Redis is the correct tool for lossy, hot, ephemeral data.
- Use Redis as the event dedup / DLQ store — rejected: dedup must be transactional with the consumer's state change and durable;
SETNX+eviction is neither (ADR-026 §3a). - Make Redis a hard dependency now — rejected: pre-pilot we run a single instance; the only functional use is a fallback-capable cache. A hard dependency would be premature operational debt.
Consequences
- Positive: one clear rule; "no Redis in events" and "Redis is fine for caching" coexist without contradiction; nobody accidentally puts correctness state in a lossy store; Redis stays removable until scale justifies it.
- Negative / accepted: multi-instance brute-force protection now has a Redis-backed option (above), but it is only active when
REDIS_URLis set; an edge WAF/gateway is still the preferred long-term layer. Until one of the two is enabled in a given environment, that environment's limit is per-pod. - Follow-ups: when we scale to N api instances, decide rate-limiting (edge preferred, Redis fallback) and consider a shared eligibility cache for hit-rate. Neither is needed at single-instance pilot.
Revisit when
- We run more than one api instance in production (rate-limit + cache-hit-rate decisions become live).
- A new feature wants Redis for something that smells like state-of-record — re-run the §1 litmus test before saying yes.