How does Reciprocal Rank Fusion (RRF) work?

RRF merges results from multiple ranked lists without requiring score normalization. Each document gets a score of 1/(k + rank) from each list, where k=60 is the standard smoothing constant. These scores are summed across lists and documents are re-ranked by total score. Because it operates on ranks rather than raw scores, RRF is robust to score distribution differences between BM25 and vector search.

When should I use a cross-encoder instead of a bi-encoder?

Use a cross-encoder for reranking, never for first-stage retrieval over large corpora. Cross-encoders process query and document jointly, enabling deep cross-attention — this produces more calibrated relevance scores but is too slow (O(N) transformer passes) to run over thousands of candidates. Bi-encoders encode query and document independently into fixed vectors, enabling fast approximate nearest-neighbor search. The standard architecture is: bi-encoder retrieves top 50-100, cross-encoder reranks to top 5-20.

Is Cohere Rerank worth the cost for production RAG?

Cohere Rerank-3 costs approximately $2 per 1,000 queries with a 50-document candidate set. For most enterprise RAG applications, the retrieval quality improvement — typically +5-15% MRR on top of hybrid search — justifies this cost compared to the cost of an LLM call on poorly-retrieved context. The main constraint is latency: Cohere adds 100-300ms of API round-trip. For latency-sensitive apps under 500ms SLAs, consider a self-hosted BGE-reranker-v2-m3 instead.

What is ColBERT and how does it differ from a cross-encoder?

ColBERT uses late interaction: query and document are encoded independently into token-level embedding matrices (not pooled into a single vector), then relevance is computed via MaxSim — the maximum dot product between each query token embedding and all document token embeddings, summed. This gives cross-encoder-level relevance quality at closer to bi-encoder latency, because the document matrices can be pre-computed and indexed. ColBERT's p50 latency is around 20-30ms for top-100 reranking vs 100-250ms for a standard cross-encoder.

When should I skip reranking entirely?

Skip reranking when your queries are simple, your documents are homogeneous, your latency budget is under 300ms total, or your corpus is small enough that retrieval recall is already near-perfect. Also skip it for conversational FAQs and single-topic knowledge bases where the top-1 BM25 hit is almost always correct. Reranking delivers the most value on technical documentation, multi-domain knowledge bases, long-tail queries, and corpora with semantically similar but factually different documents.

Hybrid Search and Reranking: Beyond Vector Similarity

The default RAG implementation most teams ship in 2026 looks like this: embed the query, run an approximate nearest-neighbor search over a vector index, return the top-k chunks, stuff them into a prompt. It is fast to build, easy to demo, and it works well enough on clean, semantically rich queries against a well-curated corpus. The problem is that real corpora are not clean. They contain product codes, part numbers, legal references, employee names, version strings, internal identifiers — tokens where exact match is the only correct match. Vector space hates these.

Dense retrieval compresses meaning into a fixed-dimensional vector. That compression is the source of its generalization power and its Achilles heel. Two documents that say "the Q3 2024 invoice INV-20240312 for client ACME Corp" and "the Q3 revenue figures" can end up at nearly the same cosine distance from a query about invoices, because they share semantic territory. But if the user asks specifically for INV-20240312, the first document is the only correct answer and the second is noise. A dense retriever will frequently surface the second because the semantic neighborhood is symmetric.

This is not an edge case. In any enterprise RAG system deployed over technical documentation, contracts, financial records, or support tickets, exact-match queries account for 20-40% of real production traffic. Shipping dense-only retrieval for these domains and then wondering why accuracy is poor is one of the most consistent failure patterns I observe. The fix is hybrid search and reranking, and it should be table stakes for any retrieval pipeline over technical content. (If your corpus is primarily PDFs with figures and tables rather than clean text, the picture changes — see multimodal RAG for the visual-token retrieval approach.) This article explains the full stack: BM25 mechanics, Reciprocal Rank Fusion, cross-encoder reranking, and a concrete pipeline architecture with real latency and cost numbers.

Why dense-only retrieval fails

Dense retrieval — bi-encoder models like text-embedding-3-large, E5-large, or BGE-m3 — maps queries and documents to vectors in a high-dimensional space and retrieves by cosine or dot-product similarity. This is excellent for capturing semantic equivalence: "automobile" and "car" land close together, "revenue" and "income" are neighbors. The model generalizes across paraphrases and conceptual synonyms.

The failure cases follow directly from the architecture:

Proper nouns and named entities. "Claude Sonnet 3.7" and "Claude Opus 4" are semantically similar (both are Anthropic models) but factually distinct. A user asking for Sonnet 3.7 release notes should not receive Opus 4 release notes. Dense models often conflate them because the surrounding semantic context is nearly identical.
Codes and identifiers. Invoice numbers, SKUs, regulatory article references (GDPR Article 17), API error codes, git commit hashes. The embedding model has seen these patterns during pre-training but treats similar-looking codes as interchangeable. INV-2024-001 and INV-2024-002 may be nearly equidistant from the query "invoice 2024" because their embedding representations differ only in a tiny subspace.
Rare and domain-specific terms. Technical jargon that appears infrequently in pre-training data gets poorly-calibrated embeddings. The model defaults to approximate representations based on morphological similarity or context, which can be very wrong. A query for "TPMS sensor recalibration" in an automotive knowledge base may surface generic sensor documentation instead of the specific recalibration procedure because the embedding model underweights "TPMS" as an uncommon token.
Negation and contrastive queries. "How to avoid memory leaks in Python" and "How to detect memory leaks in Python" will produce very similar query vectors. Vector space has poor geometry for negation and contrast — the not-relationship is not linearly encoded.

In benchmark terms, dense-only retrieval achieves Recall@10 of approximately 0.58-0.65 on heterogeneous enterprise corpora. Hybrid search (BM25 + dense) consistently brings this to 0.72-0.85. The gap is not noise — it is systematic, and it comes from exactly the failure modes described above.

Lesson learned

On a legal document RAG system we audited, 31% of failed retrievals were for specific article references (e.g., "Article L442-6 of the Commercial Code"). The dense model consistently ranked thematically-related articles higher than the exact one requested. Switching to hybrid search alone, without any reranking, recovered 24 of those 31 failure cases. BM25 treats "L442-6" as an exact token match and surfaces the right document first.

BM25: the sparse retrieval workhorse

BM25 (Best Match 25) is a bag-of-words ranking function derived from the probabilistic relevance model. It has been the backbone of search engines since the 1990s, including early Elasticsearch and Solr. Understanding its mechanics matters because its hyperparameters are tunable and the defaults are not always optimal for your domain.

The BM25 score for a document D given a query Q is:

BM25 scoring formula

score(D, Q) = Σ IDF(qᵢ) · [ f(qᵢ, D) · (k1 + 1) ]
                             ÷ [ f(qᵢ, D) + k1 · (1 - b + b · |D| / avgdl) ]

Where:
  qᵢ       = each query term
  f(qᵢ, D) = term frequency of qᵢ in document D
  |D|       = document length (in tokens)
  avgdl    = average document length in the corpus
  IDF(qᵢ)  = log((N - n(qᵢ) + 0.5) / (n(qᵢ) + 0.5) + 1)
  N        = total number of documents
  n(qᵢ)    = number of documents containing qᵢ
  k1       = term frequency saturation parameter (default: 1.2 - 1.5)
  b        = length normalization parameter (default: 0.75)

Two hyperparameters control BM25 behavior and are worth understanding:

k1 (term frequency saturation). Controls how much repeated occurrences of a term increase the score. At k1=0, the model becomes binary — term frequency doesn't matter at all, only presence. At high k1 values (2.0+), the model behaves more like raw TF-IDF and rewards repetition heavily. The standard default of k1=1.2 means the scoring is sublinear: a document mentioning a query term 10 times gets significantly higher scores than one mentioning it once, but not 10x higher. For short documents (chunks under 200 tokens), slightly higher k1 (1.5) can help because term repetition is a stronger signal in concise text.

b (length normalization). Controls how much document length penalizes scores. At b=0, no length normalization is applied — longer documents naturally win because they have more term occurrences. At b=1, full length normalization is applied. The default b=0.75 is a reasonable compromise for most corpora. If your chunks are uniform size (which they often are in RAG systems with fixed-size chunking), b has less effect. For variable-length documents — a mix of one-paragraph summaries and 20-page reports — tuning b down slightly (0.5-0.6) can help prevent long documents from dominating.

In practice, for RAG systems with fixed-size chunking (512-1024 tokens), BM25's default parameters (k1=1.2, b=0.75) are fine as a starting point. The real value of BM25 in hybrid search is not in hyperparameter tuning — it is in what it does that dense retrieval cannot: exact token matching with IDF weighting. A rare term that appears in only 3 documents in your 50,000-chunk corpus gets a very high IDF weight. When that term appears in the query, BM25 will surface those 3 documents at the top. Dense retrieval has no equivalent mechanism for this.

Reciprocal Rank Fusion: the right way to merge ranked lists

You have two ranked lists: one from BM25, one from your dense retriever. Now you need to merge them into a single ranked list that is better than either alone. The naive approach is to normalize both score distributions and take a weighted sum. This is fragile: BM25 scores are unbounded and depend heavily on corpus statistics, while cosine similarity scores cluster in [0, 1]. Any fixed weighting breaks when the corpus changes.

Reciprocal Rank Fusion (Cormack et al., 2009) sidesteps score normalization entirely. It operates on ranks, not scores:

RRF scoring formula

RRF_score(d) = Σ  1 / (k + rank_i(d))
               i∈retrievers

Where:
  d        = document being scored
  rank_i(d) = rank of document d in retriever i's result list
             (1-indexed; documents not in the list are omitted)
  k        = smoothing constant (default: 60)

The k=60 default is not arbitrary. Cormack et al. found empirically that k=60 provides a smooth transition between high-ranked documents (which get a bonus relative to very-low-ranked documents) and low-ranked documents (which all converge toward 0). Using k=60, the #1 result gets 1/61 ≈ 0.0164, the #10 result gets 1/70 ≈ 0.0143, and the #100 result gets 1/160 ≈ 0.0063. The score differences between top positions are small but consistent, which is what you want for fusion: you are combining agreement signals, not raw score magnitudes.

Here is a complete Python implementation that runs in 10 lines and handles an arbitrary number of retrieval systems:

RRF implementation (Python)

from collections import defaultdict

def reciprocal_rank_fusion(
    ranked_lists: list[list[str]],
    k: int = 60
) -> list[tuple[str, float]]:
    scores: dict[str, float] = defaultdict(float)
    for ranked_list in ranked_lists:
        for rank, doc_id in enumerate(ranked_list, start=1):
            scores[doc_id] += 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)


# Usage: pass in result ID lists from BM25 and dense retrieval
bm25_results   = ["doc_42", "doc_7",  "doc_19", "doc_3"]
dense_results  = ["doc_19", "doc_42", "doc_8",  "doc_7"]

fused = reciprocal_rank_fusion([bm25_results, dense_results], k=60)
# fused = [("doc_42", 0.0327), ("doc_19", 0.0310), ("doc_7", 0.0290), ...]

A document that appears at rank 1 in the BM25 list and rank 3 in the dense list will outscore a document that appears at rank 2 in the dense list but not at all in the BM25 list. This cross-list agreement bonus is exactly what makes RRF effective: it promotes documents that multiple retrieval strategies agree are relevant, which is a strong signal of true relevance.

Lesson learned

Most vector databases now implement RRF natively. Qdrant, Elasticsearch (hybrid search), Weaviate, and Milvus all support hybrid queries with RRF fusion out of the box. You do not need to run BM25 and dense retrieval in separate systems and merge in application code unless your vector database does not support it. Prefer the native implementation — it runs the two searches in parallel and merges at the database layer, which is both faster and simpler.

Sparse-dense fusion approaches compared

RRF is not the only fusion strategy. There are three main approaches, each with different tradeoff profiles:

1. Reciprocal Rank Fusion (RRF). Rank-based, parameter-free (k=60 is a sensible default), robust to corpus changes, no score normalization required. This is the right default for almost every team. It is easy to implement, easy to reason about, and consistently delivers 10-25% MRR improvement over dense-only retrieval on technical corpora. The only downside is that it discards score magnitude information — a document at rank 1 with a BM25 score of 50 gets the same RRF contribution as one with a BM25 score of 5.

2. Linear combination (weighted sum). Normalize BM25 and dense scores to [0, 1], then compute alpha * dense_score + (1 - alpha) * bm25_score. The alpha parameter controls the tradeoff between lexical and semantic retrieval. This preserves score magnitude information, which can be valuable when one retriever consistently produces high-confidence scores. The problem: BM25 scores are corpus-dependent and change when documents are added. A fixed alpha calibrated on a 10,000-document corpus may behave differently after you add another 5,000 documents. Linear combination requires re-calibration whenever the corpus changes significantly, and it requires a held-out evaluation set to tune alpha. Use this only if you have the infrastructure to re-tune regularly.

3. Learned fusion. Train a small model (logistic regression, lightweight transformer) to predict relevance from BM25 score, dense score, and optionally other features (BM25 field-level scores, document metadata, query type). This is what production search engines at Google and Bing scale do. For most enterprise RAG systems, this is over-engineering: you need labeled training data, a separate training pipeline, and ongoing re-training. Worth considering only if you have tens of thousands of labeled query-document pairs and an established eval infrastructure.

The practical recommendation: start with RRF. Move to learned fusion only after you have exhausted retrieval architecture improvements (chunking, metadata filtering, query expansion) and have labeled data at scale.

Cross-encoder reranking: why two-stage is the standard

Hybrid search with RRF significantly improves retrieval recall — your correct document is in the top-20 results at a much higher rate. But Recall@20 is not your product metric. Your product metric is Recall@5 or Recall@3, because you are stuffing 3-5 chunks into a context window. A document at rank 18 might as well not exist.

This is what cross-encoder reranking solves. After hybrid search returns top-20 or top-50 candidates, a cross-encoder re-scores them with full query-document joint attention, producing a much more precise relevance ranking. The top-5 after reranking is substantially better than the top-5 from the original retrieval.

The architectural distinction between bi-encoders and cross-encoders is fundamental:

Bi-encoder (retrieval): Query and document are encoded independently into fixed vectors. Relevance is computed as vector similarity — a single dot product. This enables pre-computing all document vectors and running fast approximate nearest-neighbor search. Complexity is O(1) per query at inference time after indexing. But the model never sees the query and document together — it cannot compute cross-attention between query tokens and document tokens.
Cross-encoder (reranking): Query and document are concatenated and passed together through the transformer. Every query token can attend to every document token. This produces far more calibrated relevance scores but requires a full forward pass per (query, document) pair. Complexity is O(N) in the number of candidates, which is why cross-encoders are never used for first-stage retrieval over large corpora — the latency would be prohibitive.

The two-stage architecture — bi-encoder retrieves 50-100 candidates, cross-encoder reranks to top 5-20 — is the standard because it gets the best of both: the recall of fast approximate search, and the precision of deep relevance modeling. Hybrid search handles the recall problem (getting the right document into the candidate set). The cross-encoder handles the precision problem (getting it to rank #1).

In terms of measured improvement: hybrid search alone over dense-only typically gives +10-25% MRR on technical corpora. Adding a cross-encoder reranker on top of hybrid search gives another +5-15% MRR. The numbers are cumulative — the total improvement from dense-only to hybrid+rerank is typically +20-35% MRR on realistic enterprise benchmarks.

Here is a concrete example using Cohere Rerank-3:

Cohere Rerank API call (Python)

import cohere

co = cohere.Client("your-api-key")

# After hybrid search returns top-50 candidates
query = "TPMS sensor recalibration procedure Peugeot 308"
candidate_docs = [doc.text for doc in hybrid_search_results[:50]]

rerank_response = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=candidate_docs,
    top_n=5,
    return_documents=True
)

# rerank_response.results is sorted by relevance_score descending
top_5_chunks = [r.document.text for r in rerank_response.results]

A few notes on this implementation. The top_n=5 parameter tells Cohere to return only the top 5 after reranking — you still pass all 50 candidates for scoring. Passing more candidates improves reranking quality marginally but increases latency and cost linearly. In practice, 20-50 candidates is the sweet spot. The return_documents=True flag returns the document text alongside the score — useful for direct injection into the prompt without an extra lookup.

Top reranker models compared

As of May 2026, the main options for production reranking are:

Model	Provider	Params	Latency (top-20)	Cost / 1K queries
rerank-v3.5	Cohere	Undisclosed	100–300ms (API)	~$2 (50 docs)
rerank-2	Voyage AI	Undisclosed	80–200ms (API)	~$0.05 (per 1M tokens)
BGE-reranker-v2-m3	BAAI (OSS)	568M	50–150ms (GPU)	Self-hosted
rerank-v3	Mixedbread	~435M	40–120ms (GPU)	Self-hosted / API
ms-marco-MiniLM-L-6	cross-encoder (OSS)	22M	15–40ms (CPU)	Self-hosted

Some practical notes on each:

Cohere rerank-v3.5. Highest quality ceiling among API options, strong multilingual support, handles long documents well. The $2/1K queries cost is for 50-document candidate sets — with 20 candidates it drops to around $0.80/1K. The primary tradeoff is API round-trip latency (100-300ms) plus vendor dependency. For production systems with tight latency SLAs, measure this carefully before committing. Cohere's rerank model is also notably strong on financial, legal, and technical domain content where subtle relevance distinctions matter most.

Voyage rerank-2. Token-based pricing rather than per-query, which makes it cheaper for long documents and more expensive for very short ones. Strong performance on code and technical content. Voyage models are also well-calibrated with their embedding models — if you are already using voyage-3 embeddings, using voyage rerank-2 in the same pipeline gives a consistency advantage.

BGE-reranker-v2-m3. The strongest open-weight cross-encoder available as of mid-2026. Trained on MS MARCO and multilingual data, competitive with Cohere on most benchmarks. At 568M parameters, it requires a GPU for production latency targets — on a T4, expect 50-150ms for 20 candidates. If you are self-hosting your vector database and have GPU capacity, this is the best cost-quality tradeoff. Quantized versions (INT8) run in 30-70ms on CPU for smaller candidate sets.

Mixedbread rerank-v3. Strong multilingual performance, available both as an API and for self-hosting via HuggingFace. A good middle-ground option if you want more control than Cohere allows but fewer GPU resources than BGE-v2-m3 requires.

ms-marco-MiniLM-L-6. 22M parameters, runs on CPU in under 50ms for 20 candidates. Quality is meaningfully below the larger models but it is the lowest-friction path to adding reranking. Use this to validate that reranking improves your specific eval set before committing to a heavier model. If the MiniLM reranker doesn't improve your metrics, a larger cross-encoder probably won't either — the problem is elsewhere in your pipeline.

Lesson learned

A team I worked with added Cohere Rerank to a working RAG system and saw latency jump from 1.1s to 2.4s P95 in production. The reranker was adding 1.1s of API latency that they had only measured in their local test environment (VPN to US datacenter). Measure reranker latency from your actual deployment region before committing to an API-based option. BGE-reranker deployed on a GPU in the same datacenter as your inference stack will almost always beat a managed API on latency, at the cost of operational complexity.

ColBERT and late interaction as a third path

ColBERTv2 (Santhanam et al., 2022) introduces a third architectural paradigm that sits between bi-encoders and cross-encoders: late interaction. Understanding it requires understanding where bi-encoders lose information.

A standard bi-encoder pools the token embeddings of a document into a single vector. This pooling — typically mean pooling or CLS token — compresses all the document's meaning into a single point in embedding space. When you compute similarity against a query vector, you are asking "how similar is this document's overall meaning to this query's overall meaning?" That single vector cannot represent complex, multi-aspect documents faithfully.

ColBERT skips the pooling step. Instead, it produces one embedding vector per token for both the query and the document. Relevance is then computed via MaxSim:

ColBERT MaxSim scoring

score(Q, D) = Σ  max  ( Eqᵢ · Edⱼ )
              i  j∈D

Where:
  Eqᵢ  = embedding of query token i
  Edⱼ  = embedding of document token j
  ·     = dot product

For each query token, find the document token most similar to it.
Sum these maximum similarities across all query tokens.

This late interaction achieves several things simultaneously. Document embeddings can be pre-computed and stored (enabling indexing, like a bi-encoder). But at query time, the relevance computation is token-level — every query token is compared against every document token, giving ColBERT access to fine-grained semantic matching that a pooled bi-encoder misses. The tradeoff is storage: a document with 512 tokens requires 512 vectors instead of 1. For large corpora, this storage overhead can be significant (ColBERT's PLAID index format addresses this with compression).

In practical terms, ColBERT sits at an interesting operating point:

Latency: p50 latency of 20-30ms for top-100 reranking on GPU. This is 3-5x faster than a standard cross-encoder on equivalent hardware.
Quality: Competitive with cross-encoders on many benchmarks, though cross-encoders with full attention across the entire (query, document) context still edge it out on complex relevance judgments.
Operationally: More complex to deploy than a standard cross-encoder. You need the ColBERT-specific PLAID indexing infrastructure (RAGatouille is the simplest Python wrapper). If you are already running a standard vector database, adding a cross-encoder is simpler than adopting ColBERT.

When is ColBERT worth the operational complexity? When latency budget is tight (under 500ms total pipeline), corpus size is moderate (under 10M documents), and you cannot compromise on reranking quality. ColBERT is also a strong choice for code retrieval — the token-level matching captures identifier names and API signatures better than pooled embeddings.

For most teams, the decision tree is: try a standard cross-encoder first. If latency is acceptable, ship it. If latency is not acceptable, consider ColBERT before adding infrastructure complexity. If you need to scale to hundreds of millions of documents with sub-100ms total latency, that is a different problem entirely and requires a dedicated retrieval infrastructure discussion.

Latency budget management

The retrieve-then-rerank pattern has a well-defined latency profile that you need to model before you ship. In a typical production RAG pipeline, the latency budget breaks down as follows:

Latency budget by pipeline stage (P50 estimates)

Query embedding

10–30ms

Hybrid retrieval

20–60ms

Cross-encoder rerank

50–250ms

LLM generation

600–2000ms

A few observations from this breakdown:

First, the LLM generation dominates. In most configurations, LLM latency is 60-80% of total end-to-end latency. Optimizing the reranker from 200ms to 50ms will reduce total latency by 5-10%, not 50%. If your total latency is unacceptable, the LLM call is almost certainly the primary lever — smaller model, streaming, caching, or parallel prefill. Do not over-index on reranker optimization when the LLM is the bottleneck.

Second, reranking is the most variable stage. BM25 and vector search are fast and predictable. The reranker latency depends heavily on the number of candidates, the document length, and whether you are using an API (adds network RTT) or a local model (adds GPU scheduling). The standard retrieve-50-rerank-20 pattern is a practical tradeoff: retrieve 50 candidates for good recall, rerank to 20 for the cross-encoder, return top 5 to the LLM. This gives you approximately 50-200ms of reranking overhead depending on the model.

Third, query embedding is not free. If you are hitting an embedding API endpoint rather than running a local model, query embedding adds a network round-trip on every query. For high-throughput applications, running a local embedding model (e5-small or bge-small at 33M parameters) can save 20-50ms per query at the cost of marginally lower embedding quality.

The practical latency management recipe: instrument every stage from day one. Use LangSmith, Langfuse, or OpenTelemetry spans to measure P50 and P95 latency per stage. When latency exceeds your SLA, look at the stage breakdown — the problem is almost never where you expect it to be.

When to skip the reranker

Reranking is not free, and the cost is not always worth paying. There are specific scenarios where you should skip it:

Total latency budget under 400ms. If your end-to-end SLA including LLM generation is 500ms, and LLM generation takes 400ms, you have 100ms for everything else. A cross-encoder reranker will not fit. Use hybrid search with RRF and accept the retrieval quality tradeoff.
Simple, single-topic corpora. A FAQ knowledge base with 200 documents about one product. An HR policy document with 15 sections. When your corpus is small and semantically homogeneous, retrieval recall is already near-perfect and reranking provides no measurable improvement. Dense-only retrieval may even be sufficient.
High query volume with tight cost constraints. At 1 million queries per day with Cohere Rerank-3 at $2/1K queries and 50 candidates, you are spending $2,000/day on reranking alone. At that scale, a self-hosted BGE-reranker on GPU becomes economically rational even accounting for infrastructure costs. If you cannot afford either, skip the API reranker and invest in better hybrid retrieval instead.
Conversational memory retrieval. Retrieving from short-term conversation history to maintain context. The candidate set is small (last 20 messages), documents are short, and temporal recency is a stronger relevance signal than semantic similarity. A simple cosine threshold or recency-weighted retrieval outperforms a cross-encoder here.
When your eval set shows no improvement. Build your evaluation set first, run your baseline, then add reranking and measure the delta. If MRR@5 does not improve by at least 3-5% on your eval set, the reranker is not helping your specific retrieval problem. Investigate root cause before adding complexity. If you don't have a reliable eval set yet, see building custom LLM judges first — measuring retrieval quality is the foundation of every downstream decision.

Key decision rule

The reranker is a precision tool, not a recall tool. It cannot surface documents that were not in the candidate set from the first stage. If your hybrid search recall@50 is poor, adding a reranker will not fix it — you need better retrieval first. Always measure recall@50 from your first-stage retrieval before evaluating whether a reranker helps. If recall@50 is below 0.75 on your eval set, fix the retrieval stage before adding reranking complexity.

End-to-end pipeline architecture

Putting all of this together, here is the complete pipeline architecture that I recommend as a starting point for any RAG system over technical or heterogeneous enterprise content. For a broader discussion of when RAG is the right choice and how it fits into your stack, see our RAG technical guide.

Full pipeline architecture: query to context

Query pre-processing

Embed the query with your bi-encoder. Optionally apply query expansion (HyDE, synonyms) for sparse queries.

Parallel retrieval (BM25 + dense)

Run BM25 and ANN vector search concurrently. Retrieve top-50 from each. This is where the recall budget is set.

RRF fusion

Merge BM25 and dense ranked lists using RRF (k=60). Result: unified top-50 candidate set with cross-list agreement signal.

Cross-encoder reranking

Rerank top-20 to top-50 candidates with a cross-encoder. Joint query-document attention produces calibrated relevance scores. Return top-5 to top-10.

Context assembly and LLM generation

Inject top-5 chunks into the prompt, in relevance order. Pass to LLM. If using Claude, apply prompt caching to the system prompt prefix to reduce input token cost.

A few additional implementation details worth noting:

Metadata filtering before retrieval. If your corpus has document-level metadata (date, source, document type, department), apply hard filters before the retrieval step, not after. Filtering after retrieval wastes the recall budget — you may have retrieved 50 documents only to discard 30 of them on metadata grounds. Most vector databases support pre-filtering at the index query level. Use it.

Score thresholding on the reranker output. After reranking, check whether the top-1 document's cross-encoder score is above a minimum relevance threshold. If the best candidate scores below the threshold, the retrieval failed — consider returning a "I don't have enough information to answer this" response rather than generating a hallucinated answer from low-quality context. A threshold around 0.3-0.4 (normalized cross-encoder score) is a reasonable starting point.

Observability requirements. For each query, log: query text, BM25 top-5 doc IDs and scores, dense top-5 doc IDs and scores, post-RRF top-10 doc IDs and RRF scores, post-rerank top-5 doc IDs and cross-encoder scores, final injected chunk texts. This trace is what lets you debug retrieval failures in under 30 seconds. Without it, you cannot tell whether a poor answer was caused by BM25 missing the document, dense retrieval missing it, RRF not promoting it, or the cross-encoder demoting it. See our production RAG failure modes guide for more on observability requirements.

Iteration path. If you are starting from dense-only retrieval, implement in this order: (1) add BM25 with RRF, measure delta on your eval set, (2) add the MiniLM cross-encoder on top, measure delta, (3) upgrade to BGE-reranker-v2-m3 or Cohere if MiniLM delivers meaningful improvement and you need better quality. Each step should deliver a measurable improvement on your specific data. If a step does not, the bottleneck is elsewhere — chunking, metadata, query pre-processing, or the eval set itself.

For more on when to use agentic patterns on top of this retrieval stack — multi-hop queries, tool-augmented retrieval, and planning loops — see our article on Agentic RAG. Hybrid search and reranking solve the precision problem within a single retrieval call. Agentic patterns solve the multi-hop and reasoning problems that require multiple retrieval calls with planning.

If you want to understand the full landscape of embedding model choices for the dense retrieval stage — MTEB benchmarks, multilingual models, domain-specific fine-tuning — see our guide to embedding models in 2026 (forthcoming). For vector database selection — Qdrant vs Pinecone vs Weaviate vs pgvector trade-offs — see our vector database comparison (forthcoming).

Frequently asked questions

BM25 is a sparse lexical retrieval function that scores documents based on exact term overlap with the query, with term frequency saturation and document length normalization. It excels at exact matches — product codes, names, rare terms — but has no concept of semantic meaning: "car" and "automobile" are completely unrelated tokens. Vector search encodes both query and documents into a continuous embedding space and retrieves by nearest-neighbor similarity. It captures semantic equivalence but struggles when the exact token matters. Hybrid search combines both.

RRF assigns each document a score of 1/(k + rank) from each retrieval system, then sums those scores across systems. Documents that rank high in multiple systems accumulate higher combined scores. k=60 is the empirical default from Cormack et al. 2009 — it smooths the score curve so that top-ranked documents get a meaningful bonus but low-ranked documents don't get penalized to zero. Because RRF operates on ranks rather than raw scores, it requires no score normalization and is robust to corpus changes. This is why it is the standard fusion method.

Use cross-encoders for reranking a small candidate set (20-100 documents), never for first-stage retrieval over a large corpus. Cross-encoders process query and document jointly through the full transformer — deep cross-attention makes their relevance scores highly calibrated, but every (query, document) pair requires a separate forward pass. Bi-encoders encode independently and enable fast ANN search at O(1) query time. The two-stage pattern is always: bi-encoder for recall over the full corpus, then cross-encoder for precision over the top candidates.

Cohere Rerank-3 costs approximately $2 per 1,000 queries with 50 documents. For most enterprise RAG applications, the quality improvement (+5-15% MRR on top of hybrid search) easily justifies this compared to the cost of LLM calls on poorly-retrieved context. The main practical constraint is latency: Cohere adds 100-300ms of API round-trip. For apps with sub-500ms SLAs, a self-hosted BGE-reranker-v2-m3 on GPU is a better fit — same quality level, 50-150ms latency, higher operational complexity.

ColBERT uses late interaction: it encodes query and document into token-level embedding matrices separately (enabling pre-computation), then scores relevance via MaxSim at query time — the maximum dot product between each query token and all document tokens. This gives token-level matching quality close to a cross-encoder, at 3-5x lower latency (p50 around 20-30ms for top-100). ColBERT is worth the operational complexity when latency budget is tight but you cannot compromise on reranking quality. The main downside is storage overhead (one vector per token vs. one vector per document) and more complex deployment infrastructure.

Skip reranking when: your total latency budget has no room for 50-250ms of additional overhead; your corpus is small and homogeneous (a few hundred documents, single topic); your eval set shows no measurable improvement after adding a reranker; or your query volume makes API costs prohibitive and you cannot self-host. Always validate with your eval set first — if hybrid search alone achieves MRR@5 above 0.85 on your data, adding reranking may give you negligible returns.

Hybrid Search and Reranking: Beyond Vector Similarity

Why dense-only retrieval fails

BM25: the sparse retrieval workhorse

Reciprocal Rank Fusion: the right way to merge ranked lists

Sparse-dense fusion approaches compared

Cross-encoder reranking: why two-stage is the standard

Top reranker models compared

ColBERT and late interaction as a third path

Latency budget management

When to skip the reranker

End-to-end pipeline architecture

Frequently asked questions

Further reading