Skip to main content

ADR-022: Kafka topic naming, ownership, versioning, DLQ & replay (when activated)

Context

ADR-017 establishes that the Postgres outbox + table-poller is the event-transport spine today and Kafka stays dormant until a consumer genuinely needs streaming/replay. When Kafka is activated (high-volume/ordered/replayable domains — Wallet, Risk), it must not arrive ungoverned: the current ~116 events are largely stringly-typed with no registry, one publish path is as any, and there is no DLQ/replay story. This ADR fixes the rules in advance so activation is a config flip + a compliant consumer, not a free-for-all.

Decision

When Kafka/Redpanda is activated, the following are mandatory:

  • Topic naming: demoz.<context>.<aggregate>.<event>.v<major> (e.g. demoz.payroll.run.approved.v1).
  • Ownership: the producing context owns the topic and its schema; one producer per topic; consumers never write to another context's topic.
  • Schemas: protobuf in a schema registry, BACKWARD compatibility enforced in CI (ADR-020). No event published without a registered schema.
  • Producers: transactional-outbox-only — events are produced by the relay reading the outbox (ADR-008), never directly from business code (no dual-write).
  • Delivery: at-least-once + idempotent consumers (dedup on event id) = effective-once. Money consumers MUST be idempotent. Ordering is per-aggregate via partition key = aggregate id; cross-aggregate order is never assumed.
  • DLQ: each consumer has a demoz.<context>.<consumer>.dlq topic, bounded retry with backoff, an alert on depth > 0, and a replay tool.
  • Versioning: additive = same major; breaking = a new .v2 topic with dual-publish during migration, retire v1 after consumers move.

Alternatives considered

  • Activate Kafka now with ad-hoc topics — rejected: feeds zero consumers today (ADR-017) and would bake in ungoverned naming/schemas.
  • Skip governance until problems appear — rejected: schema drift on a multi-service event backbone is a silent-failure class we must prevent before it carries load.

Consequences

  • Positive: activation is safe and mechanical; schema drift is caught in CI; poison messages are contained in DLQs, not stuck head-of-line; replay is a first-class capability.
  • Negative / accepted: more upfront rules to follow once Kafka is live; the registry + DLQ topics are extra surface to operate (only paid once Kafka is actually activated).
  • Follow-ups: even before Kafka, name the outbox-carried events with this scheme and register their protobuf schemas, so the eventual cutover is a transport change only.

Revisit when

  • The trigger in ADR-017 fires (a domain needs ordered/replayable/high-volume streaming) → activate Kafka under these rules.