Semantic Caching (The Cost Firewall)
Abstract
In high-volume RAG systems, query distribution follows a Power Law: 20% of unique queries account for 80% of traffic. Users repeatedly ask variations of "How do I reset my password?" or "What are the office hours?" Treating every one of these as a novel reasoning task is engineering malpractice. It burns GPU cycles, incurs unnecessary API costs, and degrades user experience with avoidable latency. This post introduces Semantic Caching—a layer that intercepts queries, compares their vector embeddings to previously answered questions, and serves cached responses for high-similarity matches, reducing latency from 2,000ms to 50ms.
1. Why This Topic Matters
The economic viability of an AI product often hinges on its "Token-to-Value" ratio. If you pay $0.03 to answer "How do I add a user?" the first time, that is an investment. If you pay $0.03 the next 10,000 times someone asks it, that is waste.
Beyond cost, Latency is the enemy of engagement. A semantic cache is the only way to achieve "instant" (sub-100ms) responses in an LLM architecture. It acts as a firewall, protecting your expensive inference infrastructure from the noise of repetitive user intent.
2. Core Concepts & Mental Models
-
Exact vs. Semantic Match:
-
Exact Cache (KV Store): Matches
hash("reset password"). Fails on"reset my password". -
Semantic Cache: Matches
distance(embed("reset password"), embed("I forgot my password")) < threshold. -
The Similarity Threshold (): This is your risk dial.
-
: Conservative. Only extremely close phrasings hit the cache. Low risk of wrong answers.
-
: Aggressive. "What is the capital?" might match "What is the capitol?". High risk of false positives.
-
Cache Hit Journey: User Query Embedding Vector Search (Cache) Hit? Return JSON.
-
Cache Miss Journey: User Query Embedding Vector Search (Cache) Miss Retrieval LLM Write to Cache Return JSON.
3. Theoretical Foundations
We utilize Cosine Similarity in high-dimensional space. Given a query vector and a cached vector , we compute:
The critical architectural decision is determining the optimal threshold. This is rarely a single number; it often requires calibration per domain. A coding assistant might tolerate lower similarity (code concepts overlap), whereas a medical bot requires near-exact matching ().
4. Production-Grade Implementation
The Tenant Isolation Problem The most dangerous vulnerability in semantic caching is Data Leaking.
- User A (CEO) asks: "What is the acquisition strategy?" Answer cached.
- User B (Intern) asks: "What is the acquisition strategy?" Cache Hit.
- Result: The intern sees the confidential answer meant for the CEO.
Solution: The cache key must be a composite of Embedding + TenantID + Role. We never search the global cache space; we search only within the user's permission boundary.
5. Hands-On Project / Exercise
Objective: Implement a SecureSemanticCache that intercepts queries. It must differentiate between a "Miss" (slow path) and a "Hit" (fast path) for semantically similar but syntactically different questions, while enforcing tenant isolation.
Constraints:
- Uses a localized vector store (simulating Redis/Qdrant).
- Enforces a strict similarity threshold.
- Demonstrates isolation: Similar questions from different tenants must NOT hit.
The Implementation
import time
import numpy as np
from typing import List, Dict, Tuple
# --- Mock Infrastructure ---
class MockEmbeddingModel:
"""
Simulates embeddings.
Returns similar vectors for semantically similar text.
"""
def embed(self, text: str) -> np.array:
# Simplified simulation:
# We create a deterministic vector based on character sums + hashing
# to simulate 'closeness' for the demo.
text = text.lower().strip()
# Base vector for "reset password" related concepts
if any(x in text for x in ["reset", "forgot", "password"]):
base = np.array([0.9, 0.1, 0.1])
# Base vector for "salary" related concepts
elif "salary" in text:
base = np.array([0.1, 0.9, 0.1])
else:
base = np.array([0.1, 0.1, 0.9])
# Add slight noise based on length to make vectors distinct but close
noise = len(text) * 0.001
return base + noise
class MockLLM:
def generate(self, prompt: str) -> str:
time.sleep(1.0) # Simulate expensive API call
return f"Generated Answer for: {prompt}"
# --- The Semantic Cache System ---
class SecureSemanticCache:
def __init__(self, threshold: float = 0.95):
self.encoder = MockEmbeddingModel()
self.llm = MockLLM()
self.threshold = threshold
# Storage format: { tenant_id: [(vector, answer, original_query), ...] }
self.cache: Dict[str, List[Tuple[np.array, str, str]]] = {}
def _cosine_similarity(self, v1: np.array, v2: np.array) -> float:
norm_v1 = np.linalg.norm(v1)
norm_v2 = np.linalg.norm(v2)
if norm_v1 == 0 or norm_v2 == 0: return 0.0
return np.dot(v1, v2) / (norm_v1 * norm_v2)
def query(self, user_query: str, tenant_id: str) -> str:
start_time = time.time()
query_vec = self.encoder.embed(user_query)
# 1. Check Cache (Scoped to Tenant)
if tenant_id in self.cache:
best_score = -1
best_answer = None
for cached_vec, cached_ans, _ in self.cache[tenant_id]:
score = self._cosine_similarity(query_vec, cached_vec)
if score > best_score:
best_score = score
best_answer = cached_ans
# 2. Return if Hit
if best_score >= self.threshold:
latency = (time.time() - start_time) * 1000
print(f"[{tenant_id}] CACHE HIT ({best_score:.4f}) | Latency: {latency:.2f}ms")
return f"[Cached] {best_answer}"
# 3. Cache Miss - Call LLM
print(f"[{tenant_id}] CACHE MISS. Calling LLM...")
response = self.llm.generate(user_query)
# 4. Write to Cache
if tenant_id not in self.cache:
self.cache[tenant_id] = []
# Simple FIFO eviction or TTL would go here in production
self.cache[tenant_id].append((query_vec, response, user_query))
latency = (time.time() - start_time) * 1000
print(f"[{tenant_id}] Served Fresh | Latency: {latency:.2f}ms")
return response
# --- Execution ---
system = SecureSemanticCache(threshold=0.98) # High threshold for safety
# Scenario A: Initial Query (Slow)
print("\n--- Query 1: Initial Ask ---")
system.query("How do I reset my password?", tenant_id="Tenant_A")
# Scenario B: Semantically Similar Query (Fast)
# "forgot" vs "reset" should be close enough in our mock embedding logic
print("\n--- Query 2: Different Phrasing ---")
system.query("I forgot my password, how to fix?", tenant_id="Tenant_A")
# Scenario C: Tenant Isolation Check
# Tenant B asks the SAME question. Should NOT hit Tenant A's cache.
print("\n--- Query 3: Different Tenant (Security Check) ---")
system.query("How do I reset my password?", tenant_id="Tenant_B")
6. Ethical, Security & Safety Considerations
-
The "Context Poisoning" Risk: If a malicious user manages to inject a wrong answer into the cache (e.g., via a prompt injection that the LLM falls for), that wrong answer is now "canonical" for all future users.
-
Mitigation: Implement a "Write-Through" policy where cached entries are periodically re-validated or only answers with high confidence scores are cached.
-
Privacy Leaks: As demonstrated, caching must respect Access Control Lists (ACLs). Caching a summary of a confidential document creates a bypass if the cache retrieval doesn't check document permissions. Rule: The cache key must include a hash of the user's permissions.
7. Business & Strategic Implications
- The 80/20 Rule: For a SaaS platform, semantic caching can reduce LLM bills by 30-50% with zero impact on quality. It is the highest ROI feature you can build in MLOps.
- Handling "Freshness": If your underlying documents update (e.g., the policy changes), the cache is now lying.
- Strategy: You must implement Cache Invalidation. When
Document_Xis updated, find all cached vectors associated withDocument_Xand purge them. This requires tagging cache entries with source DocIDs.
8. Common Pitfalls & Misconceptions
- Global Caching: "Let's just cache everything globally to save money." This is a security disaster waiting to happen.
- Threshold Tuning: Setting the threshold too low (0.80) results in users getting answers to questions they didn't ask. This is confusing and erodes trust. Better to miss the cache than serve a false positive.
- Caching Short Queries: Short queries (e.g., "Hi") have low semantic density and can match random things. Often, it's better to use exact string matching for very short queries and semantic matching for long ones.
9. Prerequisites & Next Steps
- Prerequisite: A vector database (Redis Stack, Qdrant, or pgvector).
- Next Step: We have secured the cost, but we need visibility into the execution. Day 36 covers "Observability for Chains"—using Distributed Tracing to debug black-box AI logic.
Coming Up Next
Day 36: Observability for Chains (Tracing) - implementing OpenTelemetry and Distributed Tracing to move beyond logging and visualize the exact execution path of LLM chains.
10. Further Reading & Resources
- Tool: GPTCache (Open source semantic cache library).
- Architecture: Redis Vector Search for Semantic Caching.
- Concept: Locality-Sensitive Hashing (LSH) for faster approximate matching.