You have a RAG system running in production. It handles straightforward queries well: finding a clause in a contract, summarizing internal procedures, answering a technical FAQ. But as soon as a user asks a question that requires joining information across multiple documents, running a comparison, or following a chain of reasoning, the system falls apart.
That is not a bug. It is an architectural limitation of classic RAG. Single-shot retrieval cannot solve problems that require planning, iteration, and judgment. That is exactly what Agentic RAG provides: an agent that reasons before it retrieves, evaluates what it finds, and reformulates when the result is insufficient.
Here is what changes in practice, when it matters, and when classic RAG remains the better choice.
TL;DR
- Classic RAG does a single retrieval pass then generates β fast and cheap, but insufficient for complex queries
- Agentic RAG adds an agent that plans, decomposes queries, routes across multiple tools, and iterates until it has a reliable answer
- Key patterns: query decomposition, self-reflection, tool use, multi-step reasoning
- Best-fit use cases: multi-document analysis, regulatory audits, due diligence, comparative synthesis
- Trade-offs: 5β30 second latency and 3β10Γ API cost, but substantially higher answer quality on complex queries
Why Classic RAG Hits a Ceiling
Classic RAG works in three steps: a user submits a query, the system retrieves relevant passages from a vector store, and an LLM generates an answer from those passages. It is a linear, single-shot pipeline.
This works well for the majority of enterprise use cases. But it breaks systematically in three situations.
Multi-hop questions
A multi-hop question requires joining information from multiple documents to construct an answer. Example: "Which supplier offers the best cost/quality ratio for electronic components, accounting for lead times and warranty terms?"
Classic RAG retrieves the passages most similar to that query string. But the answer does not exist in any single document. You would need to compare supplier datasheets, cross-reference commercial terms, and synthesize. A single retrieval pass cannot do that work.
Ambiguous queries
When a user asks "What happens if we miss the deadline?", the system has no way to know whether they mean contractual penalties, regulatory exposure, or commercial consequences. Classic RAG retrieves passages most similar to the raw query without asking for clarification or exploring multiple interpretations.
Comparative analysis and synthesis
Comparing liability clauses across 15 contracts, or surfacing non-conformities from a regulatory audit spanning 200 pages of technical documentation: classic RAG cannot handle these tasks. It is bounded by its context window and, more fundamentally, it has no concept of planning a retrieval strategy.
What classic RAG cannot do
- Decompose a complex question into atomic sub-queries
- Evaluate whether retrieval results are sufficient to answer
- Reformulate a query when the first retrieval pass is inadequate
- Combine data from heterogeneous sources (documents + SQL + APIs)
- Iterate until a complete, verifiable answer is assembled
What Agentic RAG Is
Agentic RAG is an architecture where an AI agent orchestrates the entire retrieval and generation process. Instead of a linear pipeline, the agent operates in a reasoning loop: it plans, executes, evaluates, and corrects.
Concretely, here is what changes:
- Planning: the agent analyzes the question and decides on a retrieval strategy before issuing any search call.
- Decomposition: if the question is complex, the agent breaks it into independent sub-queries, each addressable by a targeted retrieval.
- Multi-source execution: for each sub-query, the agent selects the right tool (vector store, SQL query, API call) and fetches the information.
- Evaluation: the agent assesses whether results are sufficient. If a retrieved passage is off-topic or incomplete, it reformulates and retries.
- Synthesis: once all sub-answers are collected, the agent aggregates them into a coherent, sourced response.
The fundamental difference is that the agent can say "I don't have enough information β I need to search differently." Classic RAG always does a single pass and generates from whatever it found, even if that is insufficient.
Core Agentic RAG Patterns
Agentic RAG is not a monolithic technology. It is a set of composable patterns. Here are the four primary ones.
Query decomposition
This is the highest-leverage pattern. Before any retrieval, the agent analyzes the question and breaks it into atomic sub-queries.
Example: "Compare the payment terms and late-payment penalties across our three main suppliers."
The agent decomposes this into:
- What are Supplier A's payment terms?
- What are Supplier B's payment terms?
- What are Supplier C's payment terms?
- What are the late-payment penalties for Supplier A, B, and C?
Each sub-query targets a specific document, which dramatically improves retrieval precision. The agent then synthesizes the results into a comparison table.
Self-reflection
After each retrieval step, the agent evaluates the relevance and completeness of what it found. If the retrieved passages do not answer the sub-query, the agent can:
- Reformulate the query with different terms or a different embedding strategy
- Broaden the search scope (more chunks, different filters, lower similarity threshold)
- Switch data sources (fall back from vector search to a SQL query, for example)
- Report a knowledge gap if the information simply does not exist in the available data
This pattern is critical for hallucination prevention. Instead of generating from irrelevant passages, the agent recognizes its limits and acts on them. It is related to what the literature calls Self-RAG and Corrective RAG.
Tool use (function calling)
Classic RAG is limited to vector search. Agentic RAG gives the agent a toolset it can invoke dynamically:
- Vector search: for semantic questions over unstructured text
- SQL queries: for structured data (revenue figures, inventory counts, dates)
- API calls: for real-time data (pricing feeds, order status, external databases)
- Computation tools: for arithmetic or statistical operations that LLMs handle unreliably
The agent selects the appropriate tool based on query type. A question about a contract amount routes to the vector store. A question about 12-month revenue trends routes to the SQL database. This is the ReAct pattern (Reason + Act) applied to retrieval.
Multi-step reasoning
This pattern combines the previous three in an iterative loop. The agent advances step by step, each intermediate result feeding the next step.
Concrete example in a due diligence context:
- The agent identifies the 12 supplier contracts to analyze
- For each contract, it extracts termination clauses and renewal conditions
- It flags inconsistencies across contracts (different durations, contradictory clauses)
- It produces a synthesis report with identified risks ranked by severity
Each step depends on the output of the previous one. This is architecturally impossible with a single retrieval pass.
Classic RAG vs. Agentic RAG: Direct Comparison
| Dimension | Classic RAG | Agentic RAG |
|---|---|---|
| Retrieval strategy | Single-shot | Planned, iterative, adaptive |
| Query types | Simple, factual | Multi-hop, comparative, analytical |
| Data sources | Vector store only | Vector + SQL + APIs + computation |
| Failure handling | Generates anyway (hallucination risk) | Reformulates, retries, or reports gaps |
| Latency | 1β3 seconds | 5β30 seconds |
| Cost per query | 1 LLM call | 3β10 LLM calls |
| Debugging complexity | Linear pipeline, straightforward to trace | Execution graph, requires structured traces |
| Optimal use case | FAQ, search, document lookup | Legal analysis, auditing, due diligence |
The key point: Agentic RAG does not replace classic RAG. It is an additional layer for use cases that require complex reasoning. Both architectures frequently coexist in the same system.
Use Cases Where Agentic RAG Makes a Difference
Here are the recurring scenarios where agentic retrieval patterns measurably change answer quality.
Multi-document contract analysis
A legal team needs to compare non-compete clauses across 20 employment contracts. Classic RAG cannot handle this: at best it retrieves 3β5 similar passages with no guarantee of covering all contracts.
An agentic pipeline plans the analysis: it identifies all 20 contracts, systematically extracts the non-compete clause from each, compares them on defined criteria (duration, geographic scope, compensation), and produces a summary table. Time saved: several hours of senior counsel work per document set.
Regulatory compliance auditing
An engineering team needs to verify project compliance against a set of technical standards scattered across dozens of reference documents. The agent decomposes verification into individual checkpoints, cross-references each requirement against project documentation, and identifies conformance gaps β including when the required evidence is simply absent from the knowledge base.
M&A due diligence
During an acquisition, the deal team needs to map off-balance-sheet commitments, ongoing litigation, and key contracts of a target company. The agent systematically processes annual reports, board meeting minutes, and material contracts to build a structured risk map, flagging areas that need further investigation.
Technical specification comparison
A procurement team is evaluating 8 vendor bids for an industrial equipment purchase. The agent extracts specifications from each proposal, normalizes them into a common schema, identifies deviations from the requirements document, and produces a weighted selection matrix.
Implementation Stack
Several frameworks support agentic RAG. The right choice depends on use-case complexity and how much control you need over execution.
LangGraph
LangGraph (by LangChain) models the agent workflow as a state graph. Each node is a step (planning, retrieval, evaluation, generation) and edges define conditional transitions. It is the most flexible option for complex workflows with feedback loops.
Strengths: fine-grained control over transitions, native state management. Weaknesses: steep learning curve, verbose graph definitions.
LlamaIndex Workflows
LlamaIndex offers an event-driven workflow system that integrates naturally with its retrieval components. A good fit if your stack already uses LlamaIndex and you want to add agentic behavior incrementally.
CrewAI
CrewAI is oriented toward multi-agent patterns: each agent has a specialized role (researcher, analyst, writer) and they collaborate to resolve the task. Relevant when the use case requires multiple distinct competencies or parallel workstreams.
Custom orchestration
For production systems, a lightweight custom orchestrator is often preferable. Generic frameworks add abstraction layers that complicate debugging. A well-structured Python script with direct API calls (OpenAI, Anthropic, vector store) gives you more transparency and control over production monitoring.
The framework matters less than mastering the patterns β decomposition, reflection, tool routing β and implementing them in a way that is traceable. If you are starting from scratch, optimize your classic RAG first before adding an agent layer.
Limitations and Trade-offs
Agentic RAG is not a drop-in upgrade. Go in clear-eyed about its constraints.
Multiplied cost
Each request can trigger 3β10 LLM calls (planning + sub-queries + evaluation + synthesis). On a model like GPT-4o or Claude 3.5 Sonnet, that translates to roughly $0.05β$0.30 per complex query, versus $0.01β$0.03 for classic RAG. At 1,000 queries/day, monthly API spend can jump from ~$300 to ~$3,000.
Increased latency
Response time increases from 1β3 seconds (classic RAG) to 5β30 seconds (agentic), depending on iteration count. Acceptable for back-office document analysis. Incompatible with real-time customer-facing chatbots where users expect sub-second responses.
Debugging complexity
Classic RAG is a linear pipeline: if the answer is wrong, inspect the retrieved chunks, then the prompt, then the generation step. With Agentic RAG, each request produces a decision tree that must be traceable and inspectable. Without structured observability (LangSmith, Arize Phoenix, or a custom tracing system), debugging becomes untenable at scale.
Infinite loop risk
A misconfigured agent can loop indefinitely: reformulating the same query, searching without finding, burning tokens without producing a result. You must implement hard iteration limits (5β10 steps maximum) and explicit termination conditions from the start. This is non-negotiable in production.
When to Move from Classic RAG to Agentic RAG
The decision is not binary. Here are concrete criteria for evaluating whether the move is warranted.
Stay with classic RAG if
- Users ask simple, factual questions (FAQ, basic document search)
- Answer quality is above 85% and standard optimizations (hybrid search, reranking, query rewriting) have not been exhausted
- Latency is critical (real-time chatbot, live customer support)
- API budget is constrained and query volume is high
Move to Agentic RAG if
- Queries regularly require joining information across multiple documents
- Answer quality has plateaued despite standard optimizations (hybrid search, reranking, query rewriting)
- The use case involves comparative analysis, synthesis, or auditing
- Data is heterogeneous (documents + SQL + APIs) and must be combined in a single answer
- 5β30 second latency is acceptable for the use case
- The cost of a wrong answer is high (legal, regulatory, financial)
In practice, the best architecture is often a hybrid router: a lightweight classifier that evaluates query complexity upfront and routes simple questions to classic RAG (fast, cheap) and complex questions to the agentic pipeline (slower, more expensive, but more reliable). The router itself can be a simple LLM call with a structured output schema.
Summary: Agentic RAG Is an Evolution, Not a Replacement
Agentic RAG does not replace classic RAG. It completes it for the 20β30% of queries that a linear pipeline cannot handle correctly. The key is knowing when to apply it, not applying it everywhere.
If your current RAG system performs well on simple queries but fails on multi-document analysis or comparative synthesis, Agentic RAG is the logical next step. But before committing to the architectural shift, make sure you have first optimized the fundamentals: hybrid search, intelligent parsing, semantic chunking, and reranking.
There is no magic here. You are building a system, layer by layer, starting from the business need. Agentic RAG is a powerful layer β but it will not compensate for poorly prepared data or an underspecified use case.
Frequently Asked Questions
Hitting RAG limits in production?
Let's run a diagnostic together.
Related Articles
- RAG fundamentals: understanding the core architecture of Retrieval-Augmented Generation.
- Multi-agent orchestration: LangGraph vs CrewAI vs AutoGen vs custom β what we actually ship to production and why.
- Production RAG failure modes: the 5 things that consistently break RAG in production, and how to fix them.
- RAG systems: Tensoria's approach to building production-grade RAG pipelines.
- AI agents: tool-using LLMs in production β frameworks, evaluation, and observability.
Go Further
Explore our RAG systems offering or our AI agents service, or get in touch to discuss your specific use case.