Tensoria
RAG & Production AI By Anas R.

How to Optimize a RAG System: 5 Levers That Actually Move the Needle

You stood up a RAG system in an afternoon — LangChain, a vector store, a few PDFs. It looked great in the demo. Three months later, it's in production with thousands of documents, and users are complaining. Bad answers, missed retrievals, hallucinations on anything remotely complex. Sound familiar?

Optimizing RAG is not a prompt engineering problem. It is an architecture and data engineering problem. Swapping GPT-4 for Claude or tweaking your system prompt will not fix low recall. This article covers the five levers that actually move the needle — in the order you should apply them.

These are not theoretical patterns. They are the interventions we reach for first after auditing production RAG systems. Applied together, we have seen systems move from ~67% to ~89% correct answers on internal benchmarks.

Optimizing a RAG system in production — architecture and data engineering for performance
From demo to production: the technical levers for a robust, high-performance RAG system.

The 5 RAG optimization levers

  1. Hybrid Search (BM25 + vectors) — covers both semantic queries and exact-match lookups
  2. Document Parsing — OCR, table extraction, preserved structure — chunk quality starts here
  3. Semantic Chunking — split on meaning boundaries, not character counts
  4. Query Rewriting — reformulate user queries to maximize retrieval recall
  5. Reranking (cross-encoder) — re-score retrieval candidates before passing to the LLM

1. Hybrid Search: BM25 + vector retrieval

Vector search via embeddings is powerful, but it is not infallible. Many teams make the mistake of treating it as a complete solution and discarding keyword search entirely. That is a mistake — especially in domains like technical documentation, e-commerce, or legal, where exact matches matter.

Why semantics alone fails

Dense retrieval excels at capturing global meaning and context ("running shoes for trails"). It struggles with out-of-distribution vocabulary and exact matches. If a user queries a specific product reference ("SKU-12345"), a rare acronym, or a proper noun, the embedding can dilute that specific signal into a vector that matches nothing precisely. The information is there; the retrieval just can't find it.

BM25 + vectors: complementary by design

  • BM25 (sparse retrieval): An improved TF-IDF. It does not care about meaning — it looks for exact keyword matches and their frequency. If a user types "Error 504 Gateway", BM25 will surface the exact document containing that string.
  • Vector search (dense retrieval): Handles synonyms and intent. It finds the answer even if the user says "the server isn't responding" without mentioning the error code.

The real gain is not running both in parallel — it is fusing the results intelligently via RRF (Reciprocal Rank Fusion). RRF normalizes scores across both methods and surfaces documents that rank well on both semantic and lexical relevance. This is typically the highest-ROI change you can make to a RAG system with low recall. Most production vector stores — Qdrant, Weaviate, Elasticsearch — have native hybrid search support. For a detailed breakdown of how these options compare, see our vector database comparison.

For more on hybrid retrieval and how reranking plugs in on top, see our dedicated article on hybrid search and reranking.

Lesson learned

On a technical support RAG for a SaaS platform, switching from pure vector search to BM25+vector hybrid search with Qdrant halved the volume of escalated support tickets in the first two weeks. The system had been silently failing on all queries that contained exact error codes, version strings, and feature names — terms that were underrepresented in the embedding space.

2. Document Parsing: Garbage In, Garbage Out

The second pillar that teams consistently underestimate is parsing — everything that happens before chunking. The assumption that the LLM will compensate for poor input quality is wrong. Feed it garbage context and it will either hallucinate or hedge. Chunk quality is bounded by parse quality.

Extracting raw text with a basic Python script and pypdf is not enough for complex documents. PDFs are print formats, not data formats — they have no inherent notion of paragraphs, headings, or logical structure. This is the single biggest source of RAG failures on complex document corpora.

The Visual RAG era: tables, images, layouts

With vision-capable models like GPT-4o, Llama Vision, and specialized OCR models, the parsing problem has become more tractable — but it still requires deliberate engineering decisions.

Three document element types cause the most retrieval failures:

  • Tables: Flattening a table to plain text line by line destroys the column/value relationships. The LLM cannot reason over numbers presented as a text dump. Convert tables to Markdown or structured HTML to preserve spatial semantics.
  • Images and charts: Technical diagrams and financial charts often contain the answer. Either describe them textually via a VLM (captioning) or embed the image itself (multimodal embeddings). See our full treatment of this in Multimodal RAG: images, PDFs, and tables.
  • Complex layouts: Double-column layouts, sidebars, headers and footers — these pollute the text flow and create chunks that mix unrelated content mid-sentence.

Invest in layout-aware parsers before you invest in chunking sophistication. Tools worth evaluating: Unstructured, Docling, Azure Document Intelligence, or vision-based approaches (GPT-4o Vision, DeepSeek OCR). The target output is clean, structured Markdown — get there before you think about chunk boundaries.

3. Chunking: From Fixed-Size to Semantic Boundaries

Chunking is over-indexed in the literature relative to its actual ROI. Teams spend weeks on semantic chunking implementations when their real bottleneck is parsing or hybrid search. That said, chunking strategy does matter — and fixed-size chunking is a meaningful source of retrieval failures.

Fixed chunking is blind. Cutting a sentence mid-phrase, or separating a question from its answer across two chunks, produces broken embedding vectors. The retrieval system is searching for coherent semantic units and finding fragments.

The embedding models themselves are a variable here — for a current comparison of model quality on retrieval tasks, see our embedding models 2026 guide.

Semantic chunking in practice

Semantic chunking replaces character-count splits with meaning-aware segmentation:

  • Use a small embedding model to compute similarity between consecutive sentences.
  • While topic similarity stays high, keep extending the chunk.
  • When similarity drops (topic shift), close the chunk and start a new one.

This guarantees each chunk contains a complete, coherent idea — which directly improves retrieval precision.

High-ROI technique: Parent Document Retrieval

Decouple what you search from what you give the LLM. Index small, precise chunks for retrieval. When a small chunk matches, return its larger parent section to the LLM. Result: search precision of small chunks plus generative context richness of large sections. LlamaIndex has a built-in implementation. This single pattern often outperforms a full semantic chunking rewrite in terms of measurable quality delta.

Before committing to a full semantic chunking rewrite: add contextual chunk headers first (document title, section heading, date prepended to each chunk), implement Parent Document Retrieval second, measure the delta. In most cases, those two changes are sufficient and the semantic chunking investment does not justify the engineering cost.

4. Query Rewriting: Bridge the Semantic Gap

Even with a solid retrieval stack, the raw user query is often a poor retrieval key. The semantic distance problem: sometimes the answer contains no lexical overlap with the question.

  • User query: "I can't log in, it just spins forever."
  • Answer document: "OAuth2 authentication server timeout resolution procedure."

These two are semantically related but lexically distant. Embedding the user query and searching directly often fails. Query rewriting uses an intermediate LLM to translate user intent into a retrieval-optimized form.

Three techniques, ordered by implementation complexity:

  1. Multi-Query Expansion: The LLM generates 3–4 variants of the question from different angles or with technical synonyms. Run all searches, deduplicate results. Casts a wider net without adding architectural complexity.
  2. HyDE (Hypothetical Document Embeddings): Ask the LLM to hallucinate an ideal answer (factually wrong is fine). Embed that hypothetical answer and use it as the retrieval query. You are now searching answer-against-answer rather than question-against-answer. Semantic proximity is typically much better.
  3. Query Decomposition: For complex, multi-part questions ("What was the revenue delta between Q2 and Q3 after the pricing change?"), decompose into sub-queries ("Q2 revenue?", "Q3 revenue?", "pricing change date?") and resolve sequentially. Pairs naturally with agentic RAG patterns for multi-hop retrieval.

Query rewriting adds latency (one extra LLM call per query) and cost. Start with Multi-Query Expansion — it is the simplest to implement and debug, and handles the majority of semantic-gap failures.

Talk to an engineer

Stuck on a RAG optimization problem? We audit and fix production RAG systems in 2–4 weeks.

Book a call

5. Reranking: Precision Finishing

Reranking is the last lever to reach for — after hybrid search, parsing, and chunking are solid. It is the cherry on top, not the foundation. Teams who add reranking before fixing retrieval fundamentals are polishing a broken pipeline.

Bi-encoder vs. cross-encoder

Your vector store uses bi-encoders: documents are pre-embedded independently, similarity is computed as dot product or cosine. Fast — but approximate. The score does not account for the specific interaction between the query and each document.

A cross-encoder reranker (Cohere Rerank, bge-reranker-v2-m3) reads the query and each candidate document together in a single forward pass and outputs a precise relevance score. It cannot be used for initial retrieval — it is too slow — but it is ideal as a second-stage filter.

The optimized pipeline:

  1. Retrieve broad: Pull top-50 candidates from the vector store (fast).
  2. Rerank: The cross-encoder scores all 50 against the query.
  3. Top-K: Pass only the top-5 to the LLM.

This adds latency — 800ms to 1.5s is common for cross-encoders at query time. Profile it before shipping. If latency is a hard constraint, consider a faster bi-encoder reranker as a compromise, or batch async execution. The payoff: documents that should rank first but ended up at position 15 get correctly surfaced.

Evaluation: How to Know If Any of This Is Working

None of the above matters without measurement. Do not trust intuition. Do not trust spot checks. Build an evaluation pipeline before you optimize.

Use RAGAS, TruLens, or DeepEval to track three core metrics:

  • Context Precision: Of the retrieved chunks, how many were actually needed to answer the question?
  • Context Recall: Did the system retrieve all the information needed to answer correctly?
  • Faithfulness: Does the LLM answer stay within the bounds of the retrieved context, or does it hallucinate?

The iteration loop: experiment (change chunk size, add hybrid search), evaluate (run RAGAS), fix, repeat. Always start by analyzing real failure cases from production logs — understand the errors before reaching for technical solutions.

For teams whose generic RAGAS scores stop correlating with user satisfaction, the next step is building domain-specific evaluators. We cover that in Building custom LLM judges. For a broader view on structured evaluation methodology, see our article on structured outputs in LLM production.

Lesson learned

A team we audited had retrieval recall@5 at 0.91 and a 34% user dissatisfaction rate. After adding faithfulness eval, we found the LLM was ignoring retrieved context in 28% of responses — generating plausible but wrong policy details. The retrieval was fine. The issue was upstream in the system prompt, and it was invisible without end-to-end eval. Retrieval metrics are necessary but not sufficient.

What to Tackle When Optimization Is Not Enough

These five levers cover most production RAG issues. But some failure modes require architectural shifts rather than optimization. When your retrieval pipeline is solid and you still hit ceilings, two directions are worth exploring:

  • Agentic RAG: Give an LLM agent a retrieval tool and let it plan multi-step retrieval strategies. Handles complex multi-hop queries that single-shot retrieval cannot. Full breakdown in our Agentic RAG article.
  • Fine-tuning the LLM: When the domain vocabulary is so specialized that even good retrieval cannot close the gap. See our comparison of fine-tuning vs. RAG vs. prompting for when each approach makes sense, and our LoRA/QLoRA fine-tuning guide for implementation details.
  • Self-hosted architecture: When cost and data sovereignty become constraints at scale, see our guide to self-hosted RAG architecture.
  • Multi-agent orchestration: When a single RAG agent is not enough and you need coordinated retrieval across systems — multi-agent orchestration comparison covers the main frameworks.

Further reading

Talk to an engineer

Need to diagnose a RAG system that isn't performing? We run structured AI audits and fix production issues in 2–4 weeks.

Book a call

Frequently Asked Questions on RAG Optimization

Why does my RAG work fine in demo but give bad answers in production?

The classic scale problem. Demo uses 3 clean PDFs; production has thousands of files. Issues surface simultaneously: low recall (obvious info gets missed), hallucinations increase, and vector search fails on exact-match lookups (SKUs, error codes, proper nouns). Root causes: insufficient parsing of complex documents, fixed chunking that destroys context, and missing hybrid search.

What is hybrid search and why is it the first thing to add?

Hybrid search combines BM25 (keyword search) for exact term matching and vector search (semantic search) for intent and synonyms. Vector search alone misses exact matches — "SKU-12345", "Error 504 Gateway" — because it dilutes specific tokens into a vague vector. RRF fusion surfaces documents that rank well on both axes. It is usually the highest-ROI quick win for low-recall RAG.

How do I improve parsing of complex PDFs for RAG?

Chunk quality is bounded by parse quality. Raw text extraction with pypdf is insufficient for complex documents. Tables: convert to Markdown/HTML to preserve column/value structure. Images and charts: VLM captioning or multimodal embeddings. Complex layouts: use layout-aware parsers — Unstructured, Docling, Azure Document Intelligence, or GPT-4o Vision. Target output: clean Markdown before chunking starts.

What is semantic chunking and when does it actually help?

Semantic chunking uses an embedding model to detect topic shifts between consecutive sentences, closing chunks at meaning boundaries rather than character counts. It prevents splitting a question from its answer or cutting a sentence mid-phrase. The Parent Document Retrieval pattern is usually higher ROI: index small chunks for search, return larger parent sections to the LLM. Try that before committing to a full semantic chunking rewrite.

How does query rewriting improve RAG retrieval?

Query rewriting bridges semantic distance. "I can't log in, it just spins" does not lexically match "OAuth2 authentication server timeout resolution". An intermediate LLM rewrites user intent into a retrieval-optimized query. Three techniques: Multi-Query Expansion (3–4 variants merged), HyDE (embed a hypothetical answer, search answer-against-answer), Query Decomposition (split complex questions into sequential sub-queries).

What is RAG reranking and when should I use it?

Reranking is a precision finishing step — apply it after hybrid search, parsing, and chunking are solid. Your vector store uses fast Bi-Encoders (approximate similarity). A Cross-Encoder (Cohere Rerank, bge-reranker-v2-m3) reads query + document together for a precise relevance score. Pipeline: retrieve top-50 fast, reranker re-scores all 50, pass top-5 to the LLM. Adds 800ms–1.5s latency — profile before shipping.

Can I improve a RAG system without changing the LLM?

Yes. RAG optimization is not a prompt engineering or model-swapping problem. It is an architecture and data engineering problem. The five levers — hybrid search, document parsing, semantic chunking, query rewriting, reranking — can dramatically improve performance without touching the generation model. Most production RAG systems underperform because of retrieval pipeline issues, not LLM capability gaps.

In what order should I optimize RAG components?

Recommended order: (1) Hybrid search — highest-ROI quick win for recall; (2) Document parsing — fix Garbage In Garbage Out at the source; (3) Semantic chunking — preserve context for better retrieval; (4) Query rewriting — close semantic distance; (5) Reranking last — precision finishing, adds latency. Each step builds on the previous. Do not add complexity before getting the foundations right.

Anas Rabhi, data scientist specializing in generative AI
Anas Rabhi Data Scientist & Founder, Tensoria

I am a data scientist specializing in generative AI. I help engineering teams and technical leaders ship production-grade AI systems tailored to their domain. Process automation, internal knowledge assistants, intelligent document processing — I design systems that integrate into existing workflows and deliver measurable results.