ADR-005: Event-Driven Architecture β Kafka Topics Design
Metadataβ
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-02-06 |
| Linear | N/A (Foundational) |
Contextβ
STOA Platform requires asynchronous communication between components:
- Control Plane API publishes events when resources change
- AWX/Ansible consumes deployment requests and reports results
- MCP Gateway synchronizes tool catalogs based on GitOps events
- Audit Service ingests all actions for compliance
- Observability tracks gateway health and drift
The Problemβ
"How do we enable loose coupling between services while maintaining consistency and auditability?"
Traditional synchronous REST calls create tight coupling and cascading failures. STOA needs an event-driven architecture that:
- Decouples producers from consumers
- Enables replay and recovery
- Supports multi-tenant isolation
- Provides audit trail
Decisionβ
Adopt Apache Kafka (Redpanda-compatible) as the event backbone with a standardized topic structure and event envelope format.
Event Envelope Schemaβ
All events follow a consistent envelope:
{
"id": "uuid-v4",
"type": "event-type-name",
"tenant_id": "tenant-identifier",
"timestamp": "2026-02-06T10:30:00.000Z",
"user_id": "user-who-triggered",
"payload": {
// Event-specific data
}
}
Topic Inventoryβ
| Topic | Event Types | Producer | Consumer |
|---|---|---|---|
api-events | api-created, api-updated, api-deleted | Control Plane API | MCP Gateway, Audit |
deploy-requests | deploy-request | Control Plane API | AWX |
deploy-results | deploy-success, deploy-failure | AWX | Control Plane API |
app-events | app-created, app-updated, app-deleted | Control Plane API | Gateway, Audit |
tenant-events | tenant-created, tenant-updated, tenant-deleted | Control Plane API | Keycloak, Audit |
audit-log | audit | All services | Audit Service, OpenSearch |
mcp-server-events | mcp-server-registered, mcp-server-updated | Control Plane API | MCP Gateway |
mcp-sync-requests | sync-request | Console UI | MCP Gateway |
mcp-sync-results | sync-success, sync-failure | MCP Gateway | Control Plane API |
gateway-sync-requests | sync-request | Control Plane API | Gateway Adapter |
gateway-sync-results | sync-success, sync-failure | Gateway Adapter | Control Plane API |
gateway-events | health-changed, drift-detected, reconciled | Gateway Adapter | Observability |
Architectureβ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PRODUCERS β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β βControl Plane β β Console β β AWX β β
β β API β β UI β β (Ansible) β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
βββββββββββΌββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β KAFKA / REDPANDA CLUSTER β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β api-events β βdeploy-req. β β audit-log β βgateway-sync β β
β β (3 partns) β β (3 partns) β β (6 partns) β β (3 partns) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β app-events β βtenant-event β βmcp-server β βmcp-sync-* β β
β β (3 partns) β β (3 partns) β β events β β req/results β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONSUMERS β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β MCP Gateway β β Audit Serviceβ β Observabilityβ β
β β (tool β β (OpenSearch) β β (Grafana) β β
β β registry) β β β β β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Partitioning Strategyβ
Events are partitioned by tenant_id to ensure:
- Ordering β Events for the same tenant arrive in order
- Parallelism β Different tenants processed in parallel
- Isolation β Consumer groups can filter by tenant
async def publish(self, topic: str, event_type: str, tenant_id: str, payload: dict):
event = self._create_event(event_type, tenant_id, payload)
partition_key = tenant_id # Partition by tenant
future = self._producer.send(topic, value=event, key=partition_key)
future.get(timeout=10) # Wait for ack
Consumer Groupsβ
| Consumer Group | Topics | Purpose |
|---|---|---|
mcp-gateway-sync | api-events, mcp-server-events | Tool registry updates |
awx-worker | deploy-requests | Process deployment jobs |
audit-ingester | audit-log, api-events, app-events | OpenSearch indexing |
gateway-orchestrator | gateway-sync-requests | Gateway reconciliation |
observability-collector | gateway-events | Metrics/alerts |
Event Flow Examplesβ
API Creation Flowβ
Deployment Flowβ
Connection Resilienceβ
Kafka connections include retry logic:
async def connect(self):
max_retries = 5
retry_delay = 2
for attempt in range(max_retries):
try:
self._producer = KafkaProducer(
bootstrap_servers=settings.KAFKA_BOOTSTRAP_SERVERS.split(","),
acks="all", # Wait for all replicas
retries=3,
request_timeout_ms=10000,
)
return
except Exception as e:
if attempt < max_retries - 1:
logger.warning(f"Attempt {attempt + 1} failed, retrying...")
time.sleep(retry_delay)
else:
raise
Redpanda Compatibilityβ
STOA uses Redpanda as a Kafka-compatible alternative:
| Feature | Kafka | Redpanda |
|---|---|---|
| Protocol | Kafka 2.x | Compatible |
| JVM | Required | None (C++) |
| ZooKeeper | Required (< 3.x) | None |
| Single-node | Complex | Simple |
| Performance | Good | Better latency |
Configuration:
# docker-compose.yml
services:
redpanda:
image: redpandadata/redpanda:latest
command:
- redpanda start
- --kafka-addr 0.0.0.0:9092
- --advertise-kafka-addr redpanda:9092
Topic Configurationβ
Recommended settings per topic category:
| Category | Partitions | Retention | Replication |
|---|---|---|---|
| Events (api, app, tenant) | 3 | 7 days | 3 |
| Requests (deploy, sync) | 3 | 1 day | 3 |
| Results | 3 | 1 day | 3 |
| Audit | 6 | 90 days | 3 |
| Gateway health | 3 | 1 day | 3 |
Consequencesβ
Positiveβ
- Loose Coupling β Services communicate via events, not direct calls
- Reliability β Kafka persists events; consumers can replay
- Scalability β Partitions enable parallel processing
- Auditability β All events are recorded for compliance
- Observability β Event streams feed monitoring dashboards
Negativeβ
- Eventual Consistency β UI may show stale data briefly
- Operational Overhead β Kafka cluster requires management
- Message Ordering β Only guaranteed within a partition
- Debugging Complexity β Distributed traces needed
Mitigationsβ
| Challenge | Mitigation |
|---|---|
| Eventual consistency | Optimistic UI updates + polling |
| Operations | Managed Redpanda or Confluent Cloud |
| Ordering | Partition by tenant_id |
| Debugging | OpenTelemetry trace propagation |
Referencesβ
- control-plane-api/src/services/kafka_service.py
- ADR-007 β GitOps with ArgoCD
- Apache Kafka Documentation
- Redpanda Documentation
Standard Marchemalo: A 40-year veteran architect understands in 30 seconds