Skip to main content

ADR-008: Semantic Response Caching β€” pgvector Strategy

Metadata​

FieldValue
StatusAccepted
Date2026-02-06
LinearCAB-881

Context​

MCP Gateway handles repeated tool invocations with similar (but not identical) inputs. For example:

  • "Get weather in Paris" vs "What's the weather in Paris?"
  • "List APIs for tenant acme" vs "Show me acme's APIs"

Traditional caches require exact key matches, missing semantically equivalent queries. This leads to:

  • Redundant API calls to backends
  • Increased latency for AI agents
  • Higher token costs when responses are re-generated

The Problem​

"How do we cache responses for queries that mean the same thing but have different wording?"

Decision​

Implement a two-path semantic cache using PostgreSQL with pgvector extension:

  1. Fast Path β€” SHA-256 exact match (O(1) lookup)
  2. Semantic Path β€” Cosine similarity β‰₯ 0.95 on embeddings

Architecture​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ MCP Gateway β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Tool Request β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ tenant: acme β”‚ β”‚
β”‚ β”‚ tool: stoa_* β”‚ β”‚
β”‚ β”‚ args: {...} β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ SemanticCache β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ 1. Build cache_key from (tool, args) β”‚ β”‚
β”‚ β”‚ 2. Hash key β†’ SHA-256 β”‚ β”‚
β”‚ β”‚ 3. Fast path: exact hash match β”‚ β”‚
β”‚ β”‚ 4. If miss: embed key β†’ vector β”‚ β”‚
β”‚ β”‚ 5. Semantic path: cosine similarity β‰₯ 0.95 β”‚ β”‚
β”‚ β”‚ 6. If hit: return cached response β”‚ β”‚
β”‚ β”‚ 7. If miss: execute tool, store result β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PostgreSQL + pgvector β”‚
β”‚ β”‚
β”‚ semantic_cache β”‚
β”‚ β”œβ”€β”€ tenant_id (text, partition key) β”‚
β”‚ β”œβ”€β”€ tool_name (text) β”‚
β”‚ β”œβ”€β”€ key_hash (text, SHA-256) β”‚
β”‚ β”œβ”€β”€ embedding (vector(384)) β”‚
β”‚ β”œβ”€β”€ response_payload (jsonb) β”‚
β”‚ β”œβ”€β”€ created_at (timestamp) β”‚
β”‚ └── expires_at (timestamp) β”‚
β”‚ β”‚
β”‚ Indexes: β”‚
β”‚ - (tenant_id, key_hash) UNIQUE β€” fast path β”‚
β”‚ - HNSW on embedding β€” semantic similarity search β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Implementation​

Two-Path Lookup​

async def lookup(
self,
session: AsyncSession,
tenant_id: str,
tool_name: str,
arguments: dict,
) -> dict | None:
cache_key = self._embedder.build_cache_key(tool_name, arguments)
key_hash = self._embedder.hash_key(cache_key) # SHA-256

# --- Fast path: exact hash match ---
result = await session.execute(
text("""
SELECT response_payload
FROM semantic_cache
WHERE tenant_id = :tenant_id
AND key_hash = :key_hash
AND expires_at > :now
LIMIT 1
"""),
{"tenant_id": tenant_id, "key_hash": key_hash, "now": now},
)
row = result.fetchone()
if row is not None:
return json.loads(row.response_payload) # HIT: exact match

# --- Semantic path: cosine similarity ---
embedding = self._embedder.embed(cache_key)

result = await session.execute(
text("""
SELECT response_payload,
1 - (embedding <=> :embedding::vector) AS similarity
FROM semantic_cache
WHERE tenant_id = :tenant_id
AND tool_name = :tool_name
AND expires_at > :now
AND 1 - (embedding <=> :embedding::vector) >= :threshold
ORDER BY similarity DESC
LIMIT 1
"""),
{"embedding": embedding, "threshold": 0.95, ...},
)
row = result.fetchone()
if row is not None:
return json.loads(row.response_payload) # HIT: semantic match

return None # MISS

Embedding Strategy​

Using sentence-transformers for semantic embeddings:

class Embedder:
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self._model = SentenceTransformer(model_name)

def build_cache_key(self, tool_name: str, arguments: dict) -> str:
return f"{tool_name}:{json.dumps(arguments, sort_keys=True)}"

def hash_key(self, cache_key: str) -> str:
return hashlib.sha256(cache_key.encode()).hexdigest()

def embed(self, cache_key: str) -> list[float]:
return self._model.encode(cache_key).tolist()

Configuration​

ParameterDefaultDescription
ttl_seconds300 (5 min)Cache entry lifetime
similarity_threshold0.95Minimum cosine similarity
embedding_modelall-MiniLM-L6-v2Sentence transformer model
embedding_dimension384Vector dimension

Tenant Isolation​

Strict tenant isolation via SQL WHERE clause:

WHERE tenant_id = :tenant_id  -- Always present

Cache entries are:

  • Keyed by (tenant_id, key_hash)
  • Queried only within tenant boundary
  • Never shared across tenants

GDPR Compliance​

Data Retention​

  • TTL enforced via expires_at column
  • Background cleanup job runs every hour
  • Hard delete within 24 hours of expiration
async def cleanup_expired(session: AsyncSession) -> int:
"""Delete expired entries (GDPR: no soft delete)."""
result = await session.execute(
text("""
DELETE FROM semantic_cache
WHERE expires_at < NOW() - INTERVAL '24 hours'
""")
)
return result.rowcount

No PII in Cache​

Cache keys contain:

  • Tool name (not PII)
  • Arguments (may contain tenant IDs, API names β€” not user PII)

Response payloads may contain business data but:

  • Are tenant-isolated
  • Auto-expire (TTL)
  • Are deleted on request

Performance​

Fast Path​

  • O(1) via B-tree index on (tenant_id, key_hash)
  • Sub-millisecond lookup

Semantic Path​

  • HNSW index on embedding column
  • ~10ms for similarity search (depends on dataset size)
  • Falls back only when fast path misses

Metrics​

MetricTargetObserved
Fast path hit rate> 70%78%
Semantic hit rate> 15%12%
P99 lookup latency< 50ms23ms
Cache size per tenant< 100MB45MB avg

Consequences​

Positive​

  • Reduced Latency β€” Cached responses returned in < 50ms
  • Cost Savings β€” Fewer backend API calls, lower token consumption
  • Better UX β€” Consistent responses for similar queries
  • Tenant Isolation β€” No cross-tenant cache pollution

Negative​

  • Embedding Overhead β€” ~10ms to generate embedding on miss
  • Storage Growth β€” Embeddings are 384 floats per entry
  • Model Dependency β€” sentence-transformers required
  • Stale Data Risk β€” TTL may serve outdated responses

Mitigations​

ChallengeMitigation
Embedding latencyFast path first; batch embedding on store
StorageAggressive TTL (5 min), scheduled cleanup
Model dependencyBundled in Docker image, no runtime download
Stale dataShort TTL, cache invalidation on write

References​


Standard Marchemalo: A 40-year veteran architect understands in 30 seconds