ADR-022: Kafka topic naming, ownership, versioning, DLQ & replay (when activated)
- Status: Accepted
- Date: 2026-06-18
- Deciders: Principal Architect, Engineering Lead
- Relates to: ADR-008, ADR-011, ADR-017, ADR-020
Context
ADR-017 establishes that the Postgres outbox + table-poller is the event-transport spine today and Kafka stays dormant until a consumer genuinely needs streaming/replay. When Kafka is activated (high-volume/ordered/replayable domains — Wallet, Risk), it must not arrive ungoverned: the current ~116 events are largely stringly-typed with no registry, one publish path is as any, and there is no DLQ/replay story. This ADR fixes the rules in advance so activation is a config flip + a compliant consumer, not a free-for-all.
Decision
When Kafka/Redpanda is activated, the following are mandatory:
- Topic naming:
demoz.<context>.<aggregate>.<event>.v<major>(e.g.demoz.payroll.run.approved.v1). - Ownership: the producing context owns the topic and its schema; one producer per topic; consumers never write to another context's topic.
- Schemas: protobuf in a schema registry,
BACKWARDcompatibility enforced in CI (ADR-020). No event published without a registered schema. - Producers: transactional-outbox-only — events are produced by the relay reading the outbox (ADR-008), never directly from business code (no dual-write).
- Delivery: at-least-once + idempotent consumers (dedup on event id) = effective-once. Money consumers MUST be idempotent. Ordering is per-aggregate via partition key = aggregate id; cross-aggregate order is never assumed.
- DLQ: each consumer has a
demoz.<context>.<consumer>.dlqtopic, bounded retry with backoff, an alert on depth > 0, and a replay tool. - Versioning: additive = same major; breaking = a new
.v2topic with dual-publish during migration, retire v1 after consumers move.
Alternatives considered
- Activate Kafka now with ad-hoc topics — rejected: feeds zero consumers today (ADR-017) and would bake in ungoverned naming/schemas.
- Skip governance until problems appear — rejected: schema drift on a multi-service event backbone is a silent-failure class we must prevent before it carries load.
Consequences
- Positive: activation is safe and mechanical; schema drift is caught in CI; poison messages are contained in DLQs, not stuck head-of-line; replay is a first-class capability.
- Negative / accepted: more upfront rules to follow once Kafka is live; the registry + DLQ topics are extra surface to operate (only paid once Kafka is actually activated).
- Follow-ups: even before Kafka, name the outbox-carried events with this scheme and register their protobuf schemas, so the eventual cutover is a transport change only.
Revisit when
- The trigger in ADR-017 fires (a domain needs ordered/replayable/high-volume streaming) → activate Kafka under these rules.