DAY 039 / Vector DB / HNSW

Vector Database Operations (Scale)

Vector DB

HNSW

Scaling

Multi-tenancy

Performance

Abstract

Vector databases are the long-term memory of AI applications, but they behave differently than relational databases at scale. A common production outage occurs when a team migrates from a prototype (10k vectors) to production (100M vectors) and realizes that filtered vector search—e.g., "Find similar documents owned by User A"—causes latency to spike from 20ms to 2000ms. This is the "Metadata Explosion." This post dissects the mechanics of Approximate Nearest Neighbor (ANN) search under constraints, focusing on HNSW parameter tuning, indexing strategies for metadata, and the critical security architecture required for rigorous multi-tenancy.

1. Why This Topic Matters

In a standard SQL database, forgetting an index slows things down. In a Vector Database, forgetting to handle metadata correctly can break the fundamental retrieval algorithm.

The core issue is Tenant Isolation. If you build a B2B RAG application, Client X must never retrieve Client Y's documents. If you rely on post-filtering (retrieving the top 100 matches globally, then filtering for Client X), you might end up with zero results if Client X's documents aren't in the global top 100. If you rely on brute-force scanning for Client X's docs, your latency explodes. You need an architecture that guarantees isolation and performance.

2. Core Concepts & Mental Models

HNSW (Hierarchical Navigable Small World): The standard indexing algorithm. Think of it as a multi-layered highway system. Top layers are interstates (fast, big jumps); bottom layers are local roads (precise, small steps).
The Filtering Problem:
Post-filtering: Search Global Vector Space $\to$ Filter Results. Risk: Low Recall (you might filter out all results).
Pre-filtering: Filter by Metadata $\to$ Search Remaining Vectors. Risk: If the filtered subset is small or scattered, the HNSW graph traversal breaks because the "highways" don't connect the remaining nodes.
Single-Stage Filtering (The Solution): Modern vector DBs use "Bitmasking" or "Block-based searching" to traverse the graph while ignoring blocked nodes efficiently.

3. Theoretical Foundations

HNSW Complexity: Search complexity is $O(log(N))$ . However, if you apply a metadata filter that selects only fraction $F$ of the dataset:

Naive Scan: $O(N)$ (Brute force checking all $N$ items for the tag).
Indexed Filter: Ideally approaches $O(log(F*N))$ , but implementation details matter heavily.

Tuning Parameters:

ef_construction: Controls build quality. Higher = better graph, slower build.
ef_search: Controls search depth. Higher = better recall, slower latency.
Trade-off: You buy Recall with Latency.

4. Production-Grade Implementation: Multi-Tenancy

There are two approaches to isolation:

Namespaces (Physical Separation):

Mechanism: Each tenant gets a separate index/collection.
Pros: Perfect security isolation. Small graphs = fast search.
Cons: Resource heavy. Managing 10,000 indices (one per user) is an operational nightmare.

Metadata Filtering (Logical Separation):

Mechanism: One giant index. Every vector has tenant_id: "user_123". Search applies a filter.
Pros: Easy operations. Efficient resource sharing.
Cons: Requires Metadata Indexing. Without it, the DB performs a full table scan.

Recommendation: Use Logical Separation with strict Metadata Indexing for standard SaaS. Use Namespaces only for VIP tenants requiring physical isolation.

5. Hands-On Project / Exercise

Objective: Simulate the latency impact of "Unindexed" vs "Indexed" metadata filtering on a dataset of 100,000 vectors. We will demonstrate why unindexed filtering destroys performance.

Constraints:

Use numpy to simulate vector operations (no heavy DB setup required).
Compare "Scan-then-Compute" vs "Index-Lookup-then-Compute".

The Implementation

import numpy as np
import time
import uuid
from typing import List, Dict

class VectorSimulator:
    def __init__(self, size=100_000, dim=128):
        print(f"Generating {size} vectors of dimension {dim}...")
        # 1. The Data Plane (Dense Vectors)
        self.vectors = np.random.rand(size, dim).astype('float32')

        # 2. The Metadata Plane
        # Assign 10% of data to "Tenant_A", rest to random tenants
        self.metadata = ["Tenant_A" if i < size * 0.1 else "Tenant_Other" for i in range(size)]
        self.metadata = np.array(self.metadata)

        # 3. The Inverted Index (The Optimization)
        # Maps "Tenant_A" -> [Index 0, Index 1, ...]
        self.inverted_index: Dict[str, List[int]] = {}
        for idx, tenant in enumerate(self.metadata):
            if tenant not in self.inverted_index:
                self.inverted_index[tenant] = []
            self.inverted_index[tenant].append(idx)

        print("Initialization complete.\n")

    def naive_filtered_search(self, query_vec, tenant_id):
        """
        Simulates unindexed filtering.
        We must iterate through ALL metadata to check the filter,
        then compute distance for matches.
        """
        start = time.time()

        # Step 1: Scan ALL metadata (The Bottleneck)
        matches = []
        for i in range(len(self.metadata)):
            if self.metadata[i] == tenant_id:
                matches.append(i)

        # Step 2: Compute Dot Product for matches
        # (In reality, a vector DB might compute ALL distances then filter, which is even worse)
        if not matches:
            return []

        subset_vectors = self.vectors[matches]
        scores = np.dot(subset_vectors, query_vec)

        # Top 1
        best_idx = np.argmax(scores)

        duration = (time.time() - start) * 1000
        return duration, len(matches)

    def indexed_filtered_search(self, query_vec, tenant_id):
        """
        Simulates indexed filtering.
        We look up valid IDs in O(1), then only compute distances for those.
        """
        start = time.time()

        # Step 1: Index Lookup (Instant)
        # In a real DB, this is a B-Tree or Hash Map lookup
        candidate_indices = self.inverted_index.get(tenant_id, [])

        if not candidate_indices:
            return []

        # Step 2: Compute Dot Product only for candidates
        subset_vectors = self.vectors[candidate_indices]
        scores = np.dot(subset_vectors, query_vec)

        # Top 1
        best_idx = np.argmax(scores)

        duration = (time.time() - start) * 1000
        return duration, len(candidate_indices)

# --- Execution ---

sim = VectorSimulator(size=100_000)
query = np.random.rand(128).astype('float32')
target_tenant = "Tenant_A"

print(f"--- Searching for {target_tenant} (Subset size: ~10,000) ---")

# Scenario 1: Naive (Unindexed)
latency_naive, count_naive = sim.naive_filtered_search(query, target_tenant)
print(f"Naive Search:   {latency_naive:.4f} ms | Scanned Metadata: 100,000 items")

# Scenario 2: Indexed
latency_idx, count_idx = sim.indexed_filtered_search(query, target_tenant)
print(f"Indexed Search: {latency_idx:.4f} ms | Scanned Metadata: 0 items (Direct Lookup)")

factor = latency_naive / latency_idx
print(f"\n🚀 Speedup Factor: {factor:.1f}x")
print("(Note: In a distributed system with disk I/O, this gap widens to 100x-1000x)")

6. Ethical, Security & Safety Considerations

Cross-Tenant Leakage: The most severe risk in vector operations is a "Filter Bypass." If the query logic is search(filter=user_input), a malicious user might inject an empty filter or a wildcard.
Rule: Never trust the client to supply the filter. The backend must inject the tenant_id filter based on the authenticated session token (JWT).
Data Deletion Compliance: When a user requests GDPR deletion, deleting the vector is not enough. You must ensure the vector is removed from the HNSW graph (which usually requires a "soft delete" flag and periodic re-indexing/compaction) to prevent it from resurfacing.

7. Business & Strategic Implications

The Cost of Recall: Achieving 99% recall requires checking more nodes (ef_search). For many business cases (e.g., recommenders), 90% recall is acceptable. For Legal RAG, 99% is mandatory. You must tune ef_search based on the business criticality of missing a document.
Infrastructure Sizing: Vector DBs are memory-hungry. They keep the HNSW graph in RAM for speed.
Rule of Thumb: 1 million vectors (768-dim) $\approx$ 4-5 GB of RAM. Plan your budget accordingly.

8. Common Pitfalls & Misconceptions

"I don't need metadata, I'll just embed the metadata text": Bad idea. If you embed "Contract 2023", and search for "2023", the vector similarity is fuzzy. It might return "2022" if they are semantically close. For hard constraints (Dates, IDs, Categories), always use metadata filters, not vector similarity.
Ignoring Re-indexing: As you add/delete vectors, the HNSW graph degrades (becomes "tangled"). You must schedule periodic re-indexing optimization or performance will rot over months.

9. Prerequisites & Next Steps

Prerequisite: Basic understanding of dot product and vector embeddings.
Next Step: Sometimes RAG isn't enough, especially for formatting. Day 40 covers "The Fine-Tuning Pivot"—when to stop prompting and start training.

Coming Up Next

Day 40: The Fine-Tuning Pivot (Build vs. Buy) - Understanding PEFT, LoRA, and the strategic pivot from RAG to Fine-Tuning for specialized tasks.

10. Further Reading & Resources

Algorithm: Efficient and Robust Approximate Nearest Neighbor Search using Hierarchical Navigable Small World Graphs (Malkov & Yashunin).
Tool: Qdrant / Weaviate / Milvus / Pinecone (Compare their filtering architectures).
Concept: The Filtered Search Dilemma in HNSW.