ADR-008: Semantic Response Caching β pgvector Strategy
Metadataβ
| Field | Value |
|---|---|
| Status | Accepted |
| Date | 2026-02-06 |
| Linear | CAB-881 |
Contextβ
MCP Gateway handles repeated tool invocations with similar (but not identical) inputs. For example:
- "Get weather in Paris" vs "What's the weather in Paris?"
- "List APIs for tenant acme" vs "Show me acme's APIs"
Traditional caches require exact key matches, missing semantically equivalent queries. This leads to:
- Redundant API calls to backends
- Increased latency for AI agents
- Higher token costs when responses are re-generated
The Problemβ
"How do we cache responses for queries that mean the same thing but have different wording?"
Decisionβ
Implement a two-path semantic cache using PostgreSQL with pgvector extension:
- Fast Path β SHA-256 exact match (O(1) lookup)
- Semantic Path β Cosine similarity β₯ 0.95 on embeddings
Architectureβ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MCP Gateway β
β β
β ββββββββββββββββ β
β β Tool Request β β
β β β β
β β tenant: acme β β
β β tool: stoa_* β β
β β args: {...} β β
β ββββββββ¬ββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SemanticCache β β
β β β β
β β 1. Build cache_key from (tool, args) β β
β β 2. Hash key β SHA-256 β β
β β 3. Fast path: exact hash match β β
β β 4. If miss: embed key β vector β β
β β 5. Semantic path: cosine similarity β₯ 0.95 β β
β β 6. If hit: return cached response β β
β β 7. If miss: execute tool, store result β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PostgreSQL + pgvector β
β β
β semantic_cache β
β βββ tenant_id (text, partition key) β
β βββ tool_name (text) β
β βββ key_hash (text, SHA-256) β
β βββ embedding (vector(384)) β
β βββ response_payload (jsonb) β
β βββ created_at (timestamp) β
β βββ expires_at (timestamp) β
β β
β Indexes: β
β - (tenant_id, key_hash) UNIQUE β fast path β
β - HNSW on embedding β semantic similarity search β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Implementationβ
Two-Path Lookupβ
async def lookup(
self,
session: AsyncSession,
tenant_id: str,
tool_name: str,
arguments: dict,
) -> dict | None:
cache_key = self._embedder.build_cache_key(tool_name, arguments)
key_hash = self._embedder.hash_key(cache_key) # SHA-256
# --- Fast path: exact hash match ---
result = await session.execute(
text("""
SELECT response_payload
FROM semantic_cache
WHERE tenant_id = :tenant_id
AND key_hash = :key_hash
AND expires_at > :now
LIMIT 1
"""),
{"tenant_id": tenant_id, "key_hash": key_hash, "now": now},
)
row = result.fetchone()
if row is not None:
return json.loads(row.response_payload) # HIT: exact match
# --- Semantic path: cosine similarity ---
embedding = self._embedder.embed(cache_key)
result = await session.execute(
text("""
SELECT response_payload,
1 - (embedding <=> :embedding::vector) AS similarity
FROM semantic_cache
WHERE tenant_id = :tenant_id
AND tool_name = :tool_name
AND expires_at > :now
AND 1 - (embedding <=> :embedding::vector) >= :threshold
ORDER BY similarity DESC
LIMIT 1
"""),
{"embedding": embedding, "threshold": 0.95, ...},
)
row = result.fetchone()
if row is not None:
return json.loads(row.response_payload) # HIT: semantic match
return None # MISS
Embedding Strategyβ
Using sentence-transformers for semantic embeddings:
class Embedder:
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self._model = SentenceTransformer(model_name)
def build_cache_key(self, tool_name: str, arguments: dict) -> str:
return f"{tool_name}:{json.dumps(arguments, sort_keys=True)}"
def hash_key(self, cache_key: str) -> str:
return hashlib.sha256(cache_key.encode()).hexdigest()
def embed(self, cache_key: str) -> list[float]:
return self._model.encode(cache_key).tolist()
Configurationβ
| Parameter | Default | Description |
|---|---|---|
ttl_seconds | 300 (5 min) | Cache entry lifetime |
similarity_threshold | 0.95 | Minimum cosine similarity |
embedding_model | all-MiniLM-L6-v2 | Sentence transformer model |
embedding_dimension | 384 | Vector dimension |
Tenant Isolationβ
Strict tenant isolation via SQL WHERE clause:
WHERE tenant_id = :tenant_id -- Always present
Cache entries are:
- Keyed by
(tenant_id, key_hash) - Queried only within tenant boundary
- Never shared across tenants
GDPR Complianceβ
Data Retentionβ
- TTL enforced via
expires_atcolumn - Background cleanup job runs every hour
- Hard delete within 24 hours of expiration
async def cleanup_expired(session: AsyncSession) -> int:
"""Delete expired entries (GDPR: no soft delete)."""
result = await session.execute(
text("""
DELETE FROM semantic_cache
WHERE expires_at < NOW() - INTERVAL '24 hours'
""")
)
return result.rowcount
No PII in Cacheβ
Cache keys contain:
- Tool name (not PII)
- Arguments (may contain tenant IDs, API names β not user PII)
Response payloads may contain business data but:
- Are tenant-isolated
- Auto-expire (TTL)
- Are deleted on request
Performanceβ
Fast Pathβ
- O(1) via B-tree index on
(tenant_id, key_hash) - Sub-millisecond lookup
Semantic Pathβ
- HNSW index on embedding column
- ~10ms for similarity search (depends on dataset size)
- Falls back only when fast path misses
Metricsβ
| Metric | Target | Observed |
|---|---|---|
| Fast path hit rate | > 70% | 78% |
| Semantic hit rate | > 15% | 12% |
| P99 lookup latency | < 50ms | 23ms |
| Cache size per tenant | < 100MB | 45MB avg |
Consequencesβ
Positiveβ
- Reduced Latency β Cached responses returned in < 50ms
- Cost Savings β Fewer backend API calls, lower token consumption
- Better UX β Consistent responses for similar queries
- Tenant Isolation β No cross-tenant cache pollution
Negativeβ
- Embedding Overhead β ~10ms to generate embedding on miss
- Storage Growth β Embeddings are 384 floats per entry
- Model Dependency β sentence-transformers required
- Stale Data Risk β TTL may serve outdated responses
Mitigationsβ
| Challenge | Mitigation |
|---|---|
| Embedding latency | Fast path first; batch embedding on store |
| Storage | Aggressive TTL (5 min), scheduled cleanup |
| Model dependency | Bundled in Docker image, no runtime download |
| Stale data | Short TTL, cache invalidation on write |
Referencesβ
- mcp-gateway/src/cache/semantic_cache.py
- mcp-gateway/src/cache/embedder.py
- pgvector Documentation
- Sentence Transformers
- CAB-881 β Semantic Cache Implementation
Standard Marchemalo: A 40-year veteran architect understands in 30 seconds