RAG Architecture II: Hybrid Search & Re-ranking
Abstract
Vector search is excellent at understanding concepts ("network issue") but terrible at distinguishing specifics ("Error 504" vs. "Error 503"). Because embeddings compress text into abstract semantic space, they often treat conflicting facts or distinct identifiers as "semantically identical." This leads to Semantic Drift, where the system confidently retrieves the wrong error code or a document that says the opposite of what was asked. This post implements the production standard for high-precision RAG: Hybrid Search (combining Keyword + Vector) followed by a Cross-Encoder Re-ranking step.
1. Why This Topic Matters
In Days 28-29, we learned that embeddings map meaning and vector databases enable fast retrieval. But sometimes, semantic meaning isn't enough.
- Query: "iphone 15 pro max case"
- Vector Result: "iphone 14 pro case" (Semantically 99% similar: both are phone cases for high-end Apple phones).
- User Experience: 0/10. The user needs an exact fit.
The Failure Mode: Your support bot retrieves the solution for "Error 503 (Service Unavailable)" when the user asked about "Error 504 (Gateway Timeout)." The embedding model sees "Server Error" and thinks they are the same. The user follows the wrong troubleshooting steps and churns.
2. Core Concepts & Mental Models
Sparse vs. Dense Vectors
- Dense Vectors (Embeddings):
[0.1, -0.5, ...]captures context. Good for synonyms ("billing" matches "payment"). - Sparse Vectors (Keywords/BM25):
{"error": 1, "504": 1}captures exact tokens. Good for identifiers, acronyms, and product SKUs.
Hybrid Search (Reciprocal Rank Fusion)
Don't choose one. Run both.
- Dense Retrieval: Get top 50 matches (captures concept).
- Sparse Retrieval: Get top 50 matches (captures keywords).
- Merge (RRF): Combine the lists. If a document appears in both, it gets a boosted score.
The Re-ranker (Cross-Encoder)
A standard embedding model (Bi-Encoder) calculates the query vector and document vector independently. This is fast but loses nuance (like negation).
A Cross-Encoder takes the query and document together as a single input pair: [CLS] Query [SEP] Document. It performs deep self-attention on the interaction between words. It is highly accurate but computationally expensive.
3. Required Trade-offs to Surface
| Strategy | Speed | Precision (Nuance) | Cost |
|---|---|---|---|
| Bi-Encoder (Standard Vector Search) | 50ms | Low on details/negations. | Low. |
| Cross-Encoder (Re-ranking) | 500ms+ | High. Understands logic/negation. | High (GPU intensive). |
| Hybrid (Vector + BM25) | 60ms | Medium. catches keywords. | Low. |
The Decision: Use a Two-Stage Pipeline.
- Retrieve: Use Hybrid Search (Bi-Encoder + BM25) to fetch the top 50 candidates (Fast).
- Re-rank: Use a Cross-Encoder to sort those 50 and return the top 5 to the LLM (Precise).
4. Responsibility Lens: Reliability
Hallucination Prevention starts at Retrieval. If your retriever returns a document that says "Do NOT reset the server" because it matched the keyword "reset," and the LLM misses the "NOT," you cause an outage. A Cross-Encoder is your safety net—it explicitly attends to the "NOT" and down-ranks the document if the query implies a positive action, or up-ranks it if it fits the safety constraint.
5. Hands-On Project: The Precision Re-ranker
We will demonstrate how a Cross-Encoder fixes the "Negation Blindness" of standard vector search.
Scenario: A user asks "What foods are NOT safe for dogs?" Standard Vector Failure: Retrieves "Apples are safe for dogs" because "safe," "dog," and "food" are semantically close.
Step 1: Setup
You need sentence-transformers and rank_bm25.
from sentence_transformers import CrossEncoder
import numpy as np
# 1. The Corpus (Conflicting Facts)
documents = [
"Chocolate is toxic to dogs and can cause death.", # Doc 0
"Apples are safe for dogs if seeds are removed.", # Doc 1
"Grapes cause kidney failure in dogs.", # Doc 2
"Cooked chicken is a healthy source of protein for dogs." # Doc 3
]
# 2. The Query
query = "What foods are NOT safe for dogs?"
# 3. Simulation: Vector Search Results (Bi-Encoder)
# Standard embeddings often struggle with "NOT".
# They might return "Apples are safe" because it shares "safe" and "dogs".
# Let's assume the Vector DB returned all 4 docs as candidates.
retrieved_docs = documents
Step 2: The Re-ranking Step
We use a Cross-Encoder trained on MS MARCO (search data). Unlike embeddings, this model outputs a relevance score (0-1) for the pair.
# Load a lightweight Cross-Encoder
# 'ms-marco-TinyBERT-L-2-v2' is fast enough for CPU production use
cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2-v2')
# Prepare pairs: (Query, Doc)
pairs = [[query, doc] for doc in retrieved_docs]
# Predict scores
scores = cross_encoder.predict(pairs)
# Zip and Sort
ranked_results = sorted(zip(scores, retrieved_docs), key=lambda x: x[0], reverse=True)
print(f"Query: {query}\n")
print("--- Cross-Encoder Re-ranking ---")
for score, doc in ranked_results:
print(f"[{score:.4f}] {doc}")
Step 3: Analysis of Results
- Top Result (Expected): "Chocolate is toxic..." (Score: ~0.9)
- Second Result: "Grapes cause kidney failure..." (Score: ~0.8)
- Bottom Result: "Apples are safe..." (Score: ~0.01)
Why this matters: The Cross-Encoder understood that "NOT safe" aligns with "toxic" and "kidney failure," and contradicts "safe" and "healthy." A simple vector search relying on cosine similarity of the word "safe" would likely have ranked the Apple document much higher.
6. Ethical & Strategic Implications
- The "Black Box" of Re-ranking: Cross-encoders are harder to interpret. If a document is down-ranked, it's not simple math (dot product) anymore; it's a deep neural network decision.
- Cost of Inference: Adding a re-ranker adds ~200-500ms to your pipeline. For a real-time chatbot, this is acceptable. For a "Type-ahead" search bar, it is too slow.
- Strategy: Only trigger re-ranking for queries > 3 words or when the initial confidence scores of the top 3 vector matches are close to each other (ambiguity).
7. Code Examples: Reciprocal Rank Fusion (RRF)
If you can't afford a Cross-Encoder, use RRF to merge BM25 and Vector results mathematically.
def reciprocal_rank_fusion(list_a_ranks, list_b_ranks, k=60):
"""
Combines two lists of document IDs based on their rank position.
Score = 1 / (k + rank)
"""
fused_scores = {}
# Process List A (e.g., Vector Results)
for rank, doc_id in enumerate(list_a_ranks):
if doc_id not in fused_scores: fused_scores[doc_id] = 0
fused_scores[doc_id] += 1 / (k + rank)
# Process List B (e.g., Keyword Results)
for rank, doc_id in enumerate(list_b_ranks):
if doc_id not in fused_scores: fused_scores[doc_id] = 0
fused_scores[doc_id] += 1 / (k + rank)
return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
8. Common Pitfalls
- Re-ranking Everything: Do not re-rank the entire database. You must retrieve a candidate set (Top 50) first. Cross-encoding 1 million documents takes hours.
- Ignoring Length: Cross-encoders have a token limit (usually 512). If you pass a 2,000-token document, it truncates. Ensure you re-rank chunks, not full PDFs.
9. Next Steps
- Audit: Review your current retrieval. Are you losing precision on keyword-heavy queries (product SKUs, error codes)?
- Implement: Add BM25 (Sparse) alongside your Vector Search (Dense) and use RRF to merge.
- Upgrade: If precision is still low, deploy a dedicated Cross-Encoder service (e.g., a small Docker container running HuggingFace Inference).
Coming Up Next
Day 32: RAG Architecture III: Grounding & Citations - Enforcing factual fidelity through Citation-Backed Generation and Verification Loops to prevent hallucinations.