ADR-029: mTLS Certificate Lifecycle Management
Statusβ
Accepted
Dateβ
2026-02-01
Contextβ
Enterprise clients operating 100+ API consumers with mTLS certificates face operational challenges:
- Manual certificate rotation takes 5-10 days per certificate
- Zero downtime is required during rotation (SLA commitments)
- Compliance (PCI-DSS, SOC2) requires a complete audit trail of all certificate operations
- Grace periods must allow both old and new certificates to coexist during migration
Related Decisionsβ
- ADR-028: RFC 8705 Certificate Binding Validation β how fingerprints are compared
- ADR-027: X509 Header-Based Authentication β header contract
- CAB-865: Client certificate provisioning API
- CAB-866: Keycloak certificate sync
- CAB-869: Certificate rotation with grace period
Decisionβ
1. Automated provisioning via APIβ
Clients are provisioned via POST /v1/clients which generates a certificate and returns the private key once (never stored server-side).
Two certificate providers:
| Provider | Use Case | Implementation |
|---|---|---|
MockCertificateProvider | Development, demos | Self-signed certs via cryptography library |
VaultPKIProvider | Production | HashiCorp Vault PKI secrets engine (P1) |
The provider is selected via configuration. Both implement the same interface: generate_certificate(cn, tenant_id) β (cert_pem, key_pem, fingerprint, serial).
2. Grace period rotationβ
When a certificate is rotated (POST /v1/clients/{id}/rotate):
- New certificate is generated
- Old fingerprint moves to
certificate_fingerprint_previous previous_cert_expires_atis set tonow + grace_period- Keycloak user attribute updated with both fingerprints
- Private key returned once in the response
- Both old and new certificates are valid during the grace period
ββββββββββββ Rotate ββββββββββββββββββββββββββββββββ Grace Expires ββββββββββββ
β Cert A β βββββββββββ β Cert A (grace) + Cert B β βββββββββββββββ β Cert B β
β (active) β β (both valid) β β (active) β
ββββββββββββ ββββββββββββββββββββββββββββββββ ββββββββββββ
3. Time-based grace period (not request-count)β
Chosen: Grace period expires after a configurable duration (default: 24 hours, range: 1h-168h).
Rejected alternative: Request-count-based grace (e.g., "old cert valid for 1000 more requests"):
- Harder to predict when migration is complete
- Requires distributed counter across gateway instances
- Edge case: low-traffic clients may never exhaust the count
Time-based grace is simpler, predictable, and sufficient for all observed use cases.
4. Event-driven cleanupβ
Certificate lifecycle events are published to Kafka (stoa.metering.events):
| Event | Trigger | Consumer Action |
|---|---|---|
ClientCreatedEvent | POST /v1/clients | Sync fingerprint to Keycloak |
CertificateRotatedEvent | POST /v1/clients/{id}/rotate | Update Keycloak with both fingerprints |
GracePeriodExpiredEvent | Scheduled check or TTL | Clear certificate_fingerprint_previous in DB and Keycloak |
ClientRevokedEvent | DELETE /v1/clients/{id} | Remove from Keycloak, mark revoked in DB |
All consumers are idempotent β safe to replay events after failure.
5. Database schemaβ
-- Migration 013: Create clients table
CREATE TABLE clients (
id UUID PRIMARY KEY,
tenant_id VARCHAR(255) NOT NULL,
name VARCHAR(255) NOT NULL,
certificate_cn VARCHAR(255) NOT NULL,
certificate_serial VARCHAR(255),
certificate_fingerprint VARCHAR(255),
certificate_pem TEXT,
certificate_not_before TIMESTAMPTZ,
certificate_not_after TIMESTAMPTZ,
status clientstatus DEFAULT 'active',
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);
-- Migration 014: Add rotation fields
ALTER TABLE clients ADD COLUMN certificate_fingerprint_previous VARCHAR(255);
ALTER TABLE clients ADD COLUMN previous_cert_expires_at TIMESTAMPTZ;
ALTER TABLE clients ADD COLUMN last_rotated_at TIMESTAMPTZ;
ALTER TABLE clients ADD COLUMN rotation_count INTEGER DEFAULT 0;
Grace period is determined by comparing previous_cert_expires_at against current time:
@property
def is_in_grace_period(self) -> bool:
if not self.previous_cert_expires_at:
return False
return datetime.now(timezone.utc) < self.previous_cert_expires_at
6. Revocationβ
Revocation is a soft delete: status is set to revoked, the client remains in the database for audit purposes. The Keycloak user attribute is cleared, and the certificate is no longer valid for authentication.
Consequencesβ
Positiveβ
- Zero downtime rotation β both certificates valid during grace period
- Audit trail β all operations recorded via Kafka events + DB timestamps
- Configurable grace period β per-rotation, adapts to client migration speed
- Idempotent events β safe to replay, resilient to consumer failures
- Separation of concerns β API handles provisioning, Kafka handles sync
Negativeβ
- Two fingerprints active β slightly larger attack surface during grace period (mitigated by time-bounded expiry)
- Delayed cleanup β grace period expiry is async via Kafka (not immediate)
- Private key exposure β returned once in API response; must be transmitted over mTLS/HTTPS
- MockCertificateProvider β not suitable for production (self-signed certs)
Risksβ
- Kafka unavailability β if Kafka is down, Keycloak sync is delayed. Mitigated by
fail_closedmode: new clients cannot authenticate until sync completes. - Clock skew β grace period expiry depends on server time. Mitigated by using UTC everywhere and NTP sync on all nodes.
- Orphaned grace periods β if the cleanup consumer is down, old fingerprints linger. Mitigated by periodic reconciliation job.