Skip to main content

ADR-027: Redis usage posture — cache & ephemeral state only

Context

Redis is in the stack — redis:7-alpine in docker-compose.yml, an optional REDIS_URL in config, a health probe, and OpenTelemetry instrumentation — but its role has never been written down. Two risks follow from that silence:

  1. Misread as a ban. ADR-026 §3a says event dedup/DLQ live in Postgres, "not Redis." Read in isolation that sounds like a platform-wide prohibition. It is not — it is scoped to the event path.
  2. Misread as a spine. Equally, someone could reach for Redis as a store of record (sessions, locks, counters that gate money) and quietly introduce a non-durable dependency into a correctness path.

The actual footprint today (verified): exactly one functional consumer — the EWA eligibility cache (apps/api/src/products/ewa/eligibility-cache.ts, bound in ewa-api.module.ts), which is a non-authoritative UI estimate with an in-memory fallback. Everything else is plumbing around that: the health probe (_infra/health/indicators/redis.indicator.ts), the dependency-state metric, and OTel auto-instrumentation. Auth rate-limiting is deliberately in-process, not Redis (identity/auth/auth-rate-limit.ts). The app boots and is correct with REDIS_URL unset.

We need one rule that says what Redis is for, so both misreads stop.

Decision

Redis is the platform's optional cache + ephemeral-state tier. It is never a source of truth.

1. The litmus test. If losing a value on a Redis eviction, failover, or restart would cause incorrectness, double-money, or an unreconstructable state, it does not belong in Redis. It goes to Postgres (the ledger for money — ADR-006; the outbox/audit/processed_event for events — ADR-017/026).

2. Allowed uses (Redis is the right tool):

  • Read-through caches of values that are always recomputable from a system of record — e.g. the EWA eligibility estimate. A stale or missing entry must only cost a recompute, never correctness.

    Implemented as a platform-wide tier: _infra/cache/ exposes one CachePort (CACHE token, @Global CacheModule) — RedisCache when REDIS_URL is set (shared across all api instances), InMemoryCache otherwise; fail-open (a down Redis becomes cache misses, never an error). Domains inject CACHE and namespace their keys (<context>:<purpose>:) rather than building their own Redis client. EWA's eligibility cache is the first consumer (CacheEligibilityCache delegates to it). Verified against real Redis incl. the cross-instance shared property (_infra/cache/cache.integration.spec.ts, in CI).

  • Rate-limiting at multi-instance scale — a shared INCR+EXPIRE counter, if and only if we do not push rate-limiting to the edge first (see §4).

  • Short-lived ephemeral throttles / nonces / advisory locks where loss is tolerable (degrades to "best-effort," never to "wrong").

3. Forbidden uses (Postgres instead):

  • Money truth / balances (ADR-006), idempotency keys for money-moving POSTs (ADR-007), audit + outbox (ADR-008), Kafka consumer dedup / DLQ / effective-once (processed_event, ADR-026 §3a). These are durable, transactional, and reconstructable — Redis is none of those.

4. Rate-limiting specifically. The production answer is the edge (WAF / API Gateway, consistent with ADR-023 "REST only at the edge"). In-process limiting is the pre-pilot single-instance floor and becomes best-effort-per-pod when we scale out. Redis-backed limiting is the fallback if the edge is not yet in place when we run multiple api instances — not the preferred design.

Implemented: identity/auth/redis-auth-rate-limit.ts — a shared-counter limiter (atomic Lua: INCR+PEXPIRE over the same burst+sustained windows as the in-process limiter). createAuthRateLimiter(REDIS_URL) returns it when REDIS_URL is set, else the in-process limiter; main.ts builds one per process. It fails open to the in-process floor if Redis errors (never 500s the auth path, never lets brute force through unbounded). Verified against real Redis incl. the cross-instance shared-counter property (redis-auth-rate-limit.integration.spec.ts, in CI).

5. Redis stays optional until something genuinely needs it. With REDIS_URL unset the app must boot and behave correctly (in-memory fallbacks). With REDIS_URL set, the health probe is CRITICAL (a configured-but-unreachable Redis is a readiness failure — silent cache bypass is acceptable, but a dependency you declared must be honest about being down).

Alternatives considered

  • Ban Redis entirely, use Postgres for everything — rejected: caches and high-rate counters have a latency/contention profile that would put avoidable load on the primary DB at scale. Redis is the correct tool for lossy, hot, ephemeral data.
  • Use Redis as the event dedup / DLQ store — rejected: dedup must be transactional with the consumer's state change and durable; SETNX+eviction is neither (ADR-026 §3a).
  • Make Redis a hard dependency now — rejected: pre-pilot we run a single instance; the only functional use is a fallback-capable cache. A hard dependency would be premature operational debt.

Consequences

  • Positive: one clear rule; "no Redis in events" and "Redis is fine for caching" coexist without contradiction; nobody accidentally puts correctness state in a lossy store; Redis stays removable until scale justifies it.
  • Negative / accepted: multi-instance brute-force protection now has a Redis-backed option (above), but it is only active when REDIS_URL is set; an edge WAF/gateway is still the preferred long-term layer. Until one of the two is enabled in a given environment, that environment's limit is per-pod.
  • Follow-ups: when we scale to N api instances, decide rate-limiting (edge preferred, Redis fallback) and consider a shared eligibility cache for hit-rate. Neither is needed at single-instance pilot.

Revisit when

  • We run more than one api instance in production (rate-limit + cache-hit-rate decisions become live).
  • A new feature wants Redis for something that smells like state-of-record — re-run the §1 litmus test before saying yes.