Skip to main content

ADR-005: Event-Driven Architecture β€” Kafka Topics Design

Metadata​

FieldValue
StatusAccepted
Date2026-02-06
LinearN/A (Foundational)

Context​

STOA Platform requires asynchronous communication between components:

  • Control Plane API publishes events when resources change
  • AWX/Ansible consumes deployment requests and reports results
  • MCP Gateway synchronizes tool catalogs based on GitOps events
  • Audit Service ingests all actions for compliance
  • Observability tracks gateway health and drift

The Problem​

"How do we enable loose coupling between services while maintaining consistency and auditability?"

Traditional synchronous REST calls create tight coupling and cascading failures. STOA needs an event-driven architecture that:

  • Decouples producers from consumers
  • Enables replay and recovery
  • Supports multi-tenant isolation
  • Provides audit trail

Decision​

Adopt Apache Kafka (Redpanda-compatible) as the event backbone with a standardized topic structure and event envelope format.

Event Envelope Schema​

All events follow a consistent envelope:

{
"id": "uuid-v4",
"type": "event-type-name",
"tenant_id": "tenant-identifier",
"timestamp": "2026-02-06T10:30:00.000Z",
"user_id": "user-who-triggered",
"payload": {
// Event-specific data
}
}

Topic Inventory​

TopicEvent TypesProducerConsumer
api-eventsapi-created, api-updated, api-deletedControl Plane APIMCP Gateway, Audit
deploy-requestsdeploy-requestControl Plane APIAWX
deploy-resultsdeploy-success, deploy-failureAWXControl Plane API
app-eventsapp-created, app-updated, app-deletedControl Plane APIGateway, Audit
tenant-eventstenant-created, tenant-updated, tenant-deletedControl Plane APIKeycloak, Audit
audit-logauditAll servicesAudit Service, OpenSearch
mcp-server-eventsmcp-server-registered, mcp-server-updatedControl Plane APIMCP Gateway
mcp-sync-requestssync-requestConsole UIMCP Gateway
mcp-sync-resultssync-success, sync-failureMCP GatewayControl Plane API
gateway-sync-requestssync-requestControl Plane APIGateway Adapter
gateway-sync-resultssync-success, sync-failureGateway AdapterControl Plane API
gateway-eventshealth-changed, drift-detected, reconciledGateway AdapterObservability

Architecture​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PRODUCERS β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚Control Plane β”‚ β”‚ Console β”‚ β”‚ AWX β”‚ β”‚
β”‚ β”‚ API β”‚ β”‚ UI β”‚ β”‚ (Ansible) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ KAFKA / REDPANDA CLUSTER β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ api-events β”‚ β”‚deploy-req. β”‚ β”‚ audit-log β”‚ β”‚gateway-sync β”‚ β”‚
β”‚ β”‚ (3 partns) β”‚ β”‚ (3 partns) β”‚ β”‚ (6 partns) β”‚ β”‚ (3 partns) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ app-events β”‚ β”‚tenant-event β”‚ β”‚mcp-server β”‚ β”‚mcp-sync-* β”‚ β”‚
β”‚ β”‚ (3 partns) β”‚ β”‚ (3 partns) β”‚ β”‚ events β”‚ β”‚ req/results β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CONSUMERS β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ MCP Gateway β”‚ β”‚ Audit Serviceβ”‚ β”‚ Observabilityβ”‚ β”‚
β”‚ β”‚ (tool β”‚ β”‚ (OpenSearch) β”‚ β”‚ (Grafana) β”‚ β”‚
β”‚ β”‚ registry) β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Partitioning Strategy​

Events are partitioned by tenant_id to ensure:

  1. Ordering β€” Events for the same tenant arrive in order
  2. Parallelism β€” Different tenants processed in parallel
  3. Isolation β€” Consumer groups can filter by tenant
async def publish(self, topic: str, event_type: str, tenant_id: str, payload: dict):
event = self._create_event(event_type, tenant_id, payload)
partition_key = tenant_id # Partition by tenant

future = self._producer.send(topic, value=event, key=partition_key)
future.get(timeout=10) # Wait for ack

Consumer Groups​

Consumer GroupTopicsPurpose
mcp-gateway-syncapi-events, mcp-server-eventsTool registry updates
awx-workerdeploy-requestsProcess deployment jobs
audit-ingesteraudit-log, api-events, app-eventsOpenSearch indexing
gateway-orchestratorgateway-sync-requestsGateway reconciliation
observability-collectorgateway-eventsMetrics/alerts

Event Flow Examples​

API Creation Flow​

Deployment Flow​

Connection Resilience​

Kafka connections include retry logic:

async def connect(self):
max_retries = 5
retry_delay = 2

for attempt in range(max_retries):
try:
self._producer = KafkaProducer(
bootstrap_servers=settings.KAFKA_BOOTSTRAP_SERVERS.split(","),
acks="all", # Wait for all replicas
retries=3,
request_timeout_ms=10000,
)
return
except Exception as e:
if attempt < max_retries - 1:
logger.warning(f"Attempt {attempt + 1} failed, retrying...")
time.sleep(retry_delay)
else:
raise

Redpanda Compatibility​

STOA uses Redpanda as a Kafka-compatible alternative:

FeatureKafkaRedpanda
ProtocolKafka 2.xCompatible
JVMRequiredNone (C++)
ZooKeeperRequired (< 3.x)None
Single-nodeComplexSimple
PerformanceGoodBetter latency

Configuration:

# docker-compose.yml
services:
redpanda:
image: redpandadata/redpanda:latest
command:
- redpanda start
- --kafka-addr 0.0.0.0:9092
- --advertise-kafka-addr redpanda:9092

Topic Configuration​

Recommended settings per topic category:

CategoryPartitionsRetentionReplication
Events (api, app, tenant)37 days3
Requests (deploy, sync)31 day3
Results31 day3
Audit690 days3
Gateway health31 day3

Consequences​

Positive​

  • Loose Coupling β€” Services communicate via events, not direct calls
  • Reliability β€” Kafka persists events; consumers can replay
  • Scalability β€” Partitions enable parallel processing
  • Auditability β€” All events are recorded for compliance
  • Observability β€” Event streams feed monitoring dashboards

Negative​

  • Eventual Consistency β€” UI may show stale data briefly
  • Operational Overhead β€” Kafka cluster requires management
  • Message Ordering β€” Only guaranteed within a partition
  • Debugging Complexity β€” Distributed traces needed

Mitigations​

ChallengeMitigation
Eventual consistencyOptimistic UI updates + polling
OperationsManaged Redpanda or Confluent Cloud
OrderingPartition by tenant_id
DebuggingOpenTelemetry trace propagation

References​


Standard Marchemalo: A 40-year veteran architect understands in 30 seconds