Skip to main content

ADR-029: mTLS Certificate Lifecycle Management

Status​

Accepted

Date​

2026-02-01

Context​

Enterprise clients operating 100+ API consumers with mTLS certificates face operational challenges:

  • Manual certificate rotation takes 5-10 days per certificate
  • Zero downtime is required during rotation (SLA commitments)
  • Compliance (PCI-DSS, SOC2) requires a complete audit trail of all certificate operations
  • Grace periods must allow both old and new certificates to coexist during migration
  • ADR-028: RFC 8705 Certificate Binding Validation β€” how fingerprints are compared
  • ADR-027: X509 Header-Based Authentication β€” header contract
  • CAB-865: Client certificate provisioning API
  • CAB-866: Keycloak certificate sync
  • CAB-869: Certificate rotation with grace period

Decision​

1. Automated provisioning via API​

Clients are provisioned via POST /v1/clients which generates a certificate and returns the private key once (never stored server-side).

Two certificate providers:

ProviderUse CaseImplementation
MockCertificateProviderDevelopment, demosSelf-signed certs via cryptography library
VaultPKIProviderProductionHashiCorp Vault PKI secrets engine (P1)

The provider is selected via configuration. Both implement the same interface: generate_certificate(cn, tenant_id) β†’ (cert_pem, key_pem, fingerprint, serial).

2. Grace period rotation​

When a certificate is rotated (POST /v1/clients/{id}/rotate):

  1. New certificate is generated
  2. Old fingerprint moves to certificate_fingerprint_previous
  3. previous_cert_expires_at is set to now + grace_period
  4. Keycloak user attribute updated with both fingerprints
  5. Private key returned once in the response
  6. Both old and new certificates are valid during the grace period
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     Rotate      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     Grace Expires     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Cert A β”‚ ──────────→ β”‚ Cert A (grace) + Cert B β”‚ ──────────────→ β”‚ Cert B β”‚
β”‚ (active) β”‚ β”‚ (both valid) β”‚ β”‚ (active) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3. Time-based grace period (not request-count)​

Chosen: Grace period expires after a configurable duration (default: 24 hours, range: 1h-168h).

Rejected alternative: Request-count-based grace (e.g., "old cert valid for 1000 more requests"):

  • Harder to predict when migration is complete
  • Requires distributed counter across gateway instances
  • Edge case: low-traffic clients may never exhaust the count

Time-based grace is simpler, predictable, and sufficient for all observed use cases.

4. Event-driven cleanup​

Certificate lifecycle events are published to Kafka (stoa.metering.events):

EventTriggerConsumer Action
ClientCreatedEventPOST /v1/clientsSync fingerprint to Keycloak
CertificateRotatedEventPOST /v1/clients/{id}/rotateUpdate Keycloak with both fingerprints
GracePeriodExpiredEventScheduled check or TTLClear certificate_fingerprint_previous in DB and Keycloak
ClientRevokedEventDELETE /v1/clients/{id}Remove from Keycloak, mark revoked in DB

All consumers are idempotent β€” safe to replay events after failure.

5. Database schema​

-- Migration 013: Create clients table
CREATE TABLE clients (
id UUID PRIMARY KEY,
tenant_id VARCHAR(255) NOT NULL,
name VARCHAR(255) NOT NULL,
certificate_cn VARCHAR(255) NOT NULL,
certificate_serial VARCHAR(255),
certificate_fingerprint VARCHAR(255),
certificate_pem TEXT,
certificate_not_before TIMESTAMPTZ,
certificate_not_after TIMESTAMPTZ,
status clientstatus DEFAULT 'active',
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now()
);

-- Migration 014: Add rotation fields
ALTER TABLE clients ADD COLUMN certificate_fingerprint_previous VARCHAR(255);
ALTER TABLE clients ADD COLUMN previous_cert_expires_at TIMESTAMPTZ;
ALTER TABLE clients ADD COLUMN last_rotated_at TIMESTAMPTZ;
ALTER TABLE clients ADD COLUMN rotation_count INTEGER DEFAULT 0;

Grace period is determined by comparing previous_cert_expires_at against current time:

@property
def is_in_grace_period(self) -> bool:
if not self.previous_cert_expires_at:
return False
return datetime.now(timezone.utc) < self.previous_cert_expires_at

6. Revocation​

Revocation is a soft delete: status is set to revoked, the client remains in the database for audit purposes. The Keycloak user attribute is cleared, and the certificate is no longer valid for authentication.

Consequences​

Positive​

  • Zero downtime rotation β€” both certificates valid during grace period
  • Audit trail β€” all operations recorded via Kafka events + DB timestamps
  • Configurable grace period β€” per-rotation, adapts to client migration speed
  • Idempotent events β€” safe to replay, resilient to consumer failures
  • Separation of concerns β€” API handles provisioning, Kafka handles sync

Negative​

  • Two fingerprints active β€” slightly larger attack surface during grace period (mitigated by time-bounded expiry)
  • Delayed cleanup β€” grace period expiry is async via Kafka (not immediate)
  • Private key exposure β€” returned once in API response; must be transmitted over mTLS/HTTPS
  • MockCertificateProvider β€” not suitable for production (self-signed certs)

Risks​

  • Kafka unavailability β€” if Kafka is down, Keycloak sync is delayed. Mitigated by fail_closed mode: new clients cannot authenticate until sync completes.
  • Clock skew β€” grace period expiry depends on server time. Mitigated by using UTC everywhere and NTP sync on all nodes.
  • Orphaned grace periods β€” if the cleanup consumer is down, old fingerprints linger. Mitigated by periodic reconciliation job.