When does text-only RAG fail on enterprise documents?

Text-only RAG fails when the critical information lives in the visual layer rather than the text layer. This happens on scanned contracts and invoices (no text layer at all), complex tables with merged cells or nested headers (structure destroyed by naive text extraction), technical schematics and engineering drawings (meaning is in the spatial layout and annotations), and annotated images like site photos or CAD screenshots. In enterprise knowledge bases, 30-60% of documents typically contain at least one such element.

What is ColPali and how does it differ from OCR-based approaches?

ColPali is a visual document retrieval model (published 2024, Berkeley) that encodes full document page images directly into multi-vector embeddings using a PaliGemma backbone — no OCR step required. Each page produces a 1030-token embedding matrix rather than a single pooled vector, enabling fine-grained late-interaction scoring (similar to ColBERT for text). On the ViDoRe benchmark (visually rich document retrieval), ColPali significantly outperforms text extraction baselines on slides, financial reports, and technical diagrams. ColQwen2 (built on Qwen2-VL) extends this with stronger multilingual and layout understanding.

Which tools are best for PDF table extraction in a RAG pipeline?

For native PDFs, pdfplumber and Camelot extract structured tables into DataFrames or Markdown with reasonable accuracy on simple layouts. For complex or scanned PDFs, Unstructured.io and Docling both provide layout-aware extraction that identifies tables, figures, and headings before chunking. Marker (open-source) uses a fine-tuned vision model to convert PDFs to structured Markdown with accurate table rendering. The recommended production pattern is: attempt structured extraction first, fall back to VLM-based extraction (GPT-4o, Claude 3.5 Sonnet) for tables where the structured extractor returns inconsistent column counts or empty cells.

Should I use CLIP or SigLIP for image embeddings in RAG?

For document page retrieval, neither CLIP nor SigLIP is the right first choice — they are trained on natural images and struggle with dense text on pages. ColPali/ColQwen2 are purpose-built for document retrieval. For image search within a knowledge base that mixes photographs with documents (e.g., a maintenance manual with equipment photos), SigLIP (Google, 2023) and OpenCLIP ViT-H/14 are stronger than original CLIP due to sigmoid loss training and larger scale. For production, consider a two-tower approach: SigLIP for natural image chunks, ColPali for document page chunks.

How much more expensive is multimodal RAG compared to text-only RAG?

Roughly 3-8x more expensive at ingestion time, and 2-4x more expensive per query depending on how many image tokens you pass to the LLM. VLM-based ingestion (GPT-4o, Claude) costs $0.01-0.05 per page depending on image resolution and model tier. On a 10,000-page corpus, initial ingestion runs $100-500 vs $5-20 for text-only. At query time, each image sent to GPT-4o consumes roughly 765-1105 tokens depending on detail mode, adding latency and cost. ColPali-based retrieval avoids repeated VLM calls at ingestion (you only embed once), but the vector index is 10-50x larger than a text-only index.

Multimodal RAG: Retrieving from Images, PDFs, and Tables

Your RAG pipeline works well when documents are clean text. Ask it about a clause in a digitally-created contract and it finds it. Ask it "what is the pipe diameter shown on the drawing on page 12?" or "what is the total in the summary table on the last page of the invoice?" and you get silence, a hallucination, or a hedge. Standard text-only RAG does not see the visual layer of documents.

This is not an edge case. When you actually audit the document corpus in a typical enterprise — technical manuals, financial reports, legal contracts, internal procedures — you find that 30-60% of documents contain critical information that lives in tables, figures, scanned pages, or annotated diagrams. A text-only pipeline has structural blind spots over all of it. Before you dismiss this as a niche problem: those blind spots are exactly where your most technically complex, highest-value queries land.

This article covers the full engineering stack for multimodal RAG in 2026: the three architectural patterns, VLM-based ingestion with real models, ColPali and ColQwen2 and what the ViDoRe benchmark actually tells you, table extraction with Unstructured, Marker, and Docling, image embeddings with CLIP and SigLIP, multi-index retrieval, and query routing. I will also tell you when not to build this — because most teams should not start here.

Why text-only RAG fails on enterprise documents

A standard RAG pipeline assumes that documents are text-extractable. It parses PDFs into strings, chunks those strings, embeds them, and builds a vector index. This assumption breaks in four common ways that show up constantly in real enterprise deployments.

Scanned documents with no text layer. Older contracts, invoices, correspondence, and physical form submissions exist only as pixel data. A PDF parser returns an empty string. The document is invisible to the entire retrieval pipeline. You did not lose retrieval quality on these — you have zero retrieval capability, full stop.

Tables with complex structure. When a PDF parser encounters a table with merged cells, nested headers, or non-rectangular layouts, it flattens the structure into a sequence of space-delimited tokens. The semantic relationships between cells — which value belongs to which row-column combination — are destroyed. You end up indexing something like "Revenue Q1 Q2 Q3 15.2 18.7 21.3" with no structural information. The model cannot reliably interpret this during generation even if it retrieves the chunk.

Engineering and technical drawings. Schematics, P&ID diagrams, architectural floor plans, and circuit diagrams carry information in the spatial relationships between labeled components. There is no text to extract that captures "the valve marked V-207 is between the heat exchanger HX-12 and the bypass line." That information lives in the visual structure, not in the text annotations alone.

Annotated images and figures. Maintenance photos with arrows pointing to wear patterns, annotated screenshots, and infographic-style pages with embedded data visualizations all fall into this category. A text extractor picks up only the caption, not the content of the figure.

Lesson learned

On a RAG project for an engineering firm handling structural design documents, 44% of pages were scanned images with no text layer. The initial text-only pipeline had a retrieval recall@5 of 0.91 on the text-native subset and 0.03 on the scanned subset. The overall system quality metric was misleading — it averaged over a corpus where nearly half the documents were effectively invisible. Auditing the document corpus before designing the pipeline would have saved three weeks of debugging.

If your team is observing lower-than-expected RAG quality and your eval set is sampled uniformly from all documents, there is a reasonable chance you are averaging over a text-native subset that retrieves well and a visual subset that retrieves poorly. Segment your eval set by document type before drawing conclusions. This is one of the failure patterns we cover in detail in Production RAG: 5 Failure Modes We Keep Seeing.

The three approaches to multimodal RAG

There is no single architecture for multimodal RAG. Three patterns exist, each with a distinct cost-quality-complexity profile. The right choice depends on your document types, your latency requirements, and how much infrastructure complexity you can absorb.

OCR + text chunking. Convert images to text using an OCR engine, then treat the result like any other text in your existing pipeline. This is the most operationally conservative option — your vector index, retrieval logic, and LLM generation code change nothing. The cost is accuracy: OCR loses spatial layout, tables are mangled, and low-quality scans produce noisy text that degrades retrieval.

VLM-based ingestion. Render each page to an image and send it to a vision language model (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) during ingestion to generate structured text descriptions or extract structured data. The text descriptions are indexed normally. At query time, you can additionally send the page image to the LLM for visual grounding. This preserves table structure, figure content, and spatial relationships that OCR destroys — at significantly higher ingestion cost.

Native visual embeddings. Encode pages directly as visual embeddings using models like ColPali or ColQwen2, skipping OCR entirely. The entire retrieval step operates in visual embedding space. A text query is encoded by the same model into a compatible embedding, and retrieval finds visually similar pages. The retrieved page images are then passed to a VLM for generation. This is the architecturally cleanest approach and the most expensive to operate at scale due to large embedding sizes.

In practice, most production systems combine all three. Text-native documents take the OCR path (which for them is just PDF text extraction). Scanned pages with simple content take the VLM ingestion path. Pages with complex visual structure — dashboards, technical drawings, dense tables — may warrant ColPali-based retrieval. This is not elegance for its own sake; it is the only way to keep costs controlled while covering the full document type spectrum.

VLM-based ingestion: GPT-4o, Claude 3.5 Sonnet, Gemini

The core idea is straightforward: render the PDF page as a high-resolution PNG, send it to a frontier VLM with a structured prompt, and store the model's output as the indexed text for that page. What the prompt asks for determines what you index.

For general document understanding, a prompt like the following works well as a starting point:

def ingest_page_with_vlm(page_image_base64: str, page_metadata: dict) -> str:
    """
    Send a rendered PDF page to a VLM and return structured text for indexing.
    Uses Claude 3.5 Sonnet. Swap for GPT-4o or Gemini by changing the client.
    """
    import anthropic

    client = anthropic.Anthropic()

    prompt = """You are a document parser for an enterprise knowledge base.
Analyze this document page and produce a structured text description for indexing.

Your output must include:
1. A concise summary of the page content (2-4 sentences)
2. All tables: reproduce them in Markdown format, preserving row/column structure
3. All figures: describe what is shown, including any labels, values, or annotations
4. Key facts, numbers, and named entities present on the page

Be precise. Do not paraphrase numerical values or codes — reproduce them exactly.
Output plain text only, no XML or JSON wrapping."""

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": "image/png",
                            "data": page_image_base64
                        }
                    },
                    {
                        "type": "text",
                        "text": prompt
                    }
                ]
            }
        ]
    )

    return response.content[0].text

A few implementation details that matter in production.

Resolution. Render at 150-200 DPI minimum. Below 100 DPI, small text and table cell content becomes unreliable even for frontier VLMs. 300 DPI is generally the sweet spot for quality without oversized image tokens. PyMuPDF (fitz) and pdf2image both produce reliable renders.

Prompt specificity per document type. Generic prompts work reasonably well but domain-specific prompts improve table extraction precision significantly. For financial reports, add: "Reproduce all financial tables in Markdown, with column headers and row labels intact." For engineering documents: "Identify and describe all component labels, measurements, and reference codes."

Cost management. GPT-4o charges roughly 765 tokens for a 512x512 image in low-detail mode and up to 1105 tokens for high-detail processing of a full page. At a $5/1M token input rate, a 10,000-page corpus costs roughly $55-110 in image tokens alone, plus output tokens. Enable prompt caching if using Claude to amortize system prompt costs across the ingestion batch.

Async batching. Ingestion is I/O-bound. Run VLM calls concurrently with an asyncio semaphore to control parallelism (typically 10-20 concurrent requests before rate limits become an issue). Sequential ingestion of a 10,000-page corpus at 3 seconds per page is 8+ hours. Concurrent ingestion at 20 parallel requests brings this under 30 minutes.

Lesson learned

Ingesting every page through a VLM is almost always the wrong first step. Run a document classifier first: if a page has a text layer with more than 100 characters, use standard text extraction. Only route pages that fail this threshold through the VLM. In most enterprise corpora, this reduces VLM ingestion calls by 60-75%, cutting cost and latency proportionally without losing accuracy on the pages that actually need visual processing.

ColPali and ColQwen2: native visual retrieval

ColPali (Faysse et al., 2024) is a visual document retrieval model that encodes entire page images into multi-vector embeddings using a PaliGemma backbone. The key insight is late interaction: rather than compressing the page into a single pooled vector (which loses spatial and typographic structure), ColPali produces a 1030-token embedding matrix — one vector per image patch. At query time, a MaxSim scoring function computes the maximum dot product between each query token embedding and all page patch embeddings, enabling fine-grained alignment between query terms and specific regions of the page.

On the ViDoRe benchmark (Visually Rich Document Retrieval), which covers slides, financial reports, research papers, and technical tables, ColPali substantially outperforms text extraction baselines. The benchmark is a fair test of what matters in enterprise retrieval: nDCG@5 on queries that require understanding both text and visual layout to answer correctly.

ColQwen2 (built on the Qwen2-VL backbone) extends this with stronger multilingual document understanding and better performance on dense tables, which is the most common failure mode for ColPali. In our tests on French and English financial reports, ColQwen2 retrieves the correct table page 18% more often than ColPali-3B on queries that reference specific row/column values.

The practical constraints are real and worth stating directly:

Storage. A ColPali embedding for a single page is approximately 1030 vectors of 128 dimensions each. For a 10,000-page corpus, that is roughly 1.3GB of float32 vectors — compared to ~50MB for a standard text embedding index over the same content. You need a vector database that supports multi-vector storage per document: Qdrant's multi-vector collections are the most production-ready option as of 2026.
Latency. MaxSim scoring over a large ColPali index is slower than single-vector ANN search. At 100K pages, expect 200-500ms retrieval latency without optimization. PLAID-style indexing (the approach from the original ColBERT paper, adapted for ColPali) reduces this significantly through centroid compression.
Fine-tuning. Out-of-the-box ColPali performs well on generic documents but can underperform on specialized domain content — molecular biology papers, legal citations with specific formatting, industry-specific technical drawings. Domain adaptation via fine-tuning on 500-2000 relevant document-query pairs is feasible but requires GPU infrastructure and ML engineering bandwidth most teams do not have readily available.

ColPali is not a drop-in replacement for your existing retrieval stack. It is an additional retrieval modality that works best as a parallel index alongside your hybrid text retrieval. If a query's top result from the text index has low confidence, route to ColPali. If the query contains spatial or visual terms ("the diagram on page", "the chart showing"), route to ColPali first.

Table extraction strategies

Tables deserve specific treatment because they are simultaneously the most information-dense content type in enterprise documents and the hardest to extract correctly. A quarterly earnings table with 40 rows and 8 columns, properly indexed, should support precise numerical retrieval. Extracted incorrectly, it produces either garbled text that degrades retrieval or silently missing data that creates false negatives.

Structured extraction for native PDFs

When the PDF was created digitally (not scanned), tables have an underlying structure you can access programmatically. pdfplumber is the most reliable Python library for this: it exposes each table as a list of lists, handles basic cell detection, and lets you filter by bounding box to isolate tables from surrounding text. Camelot supports both stream-mode (whitespace-separated) and lattice-mode (line-separated) table detection, which covers the two most common native PDF table layouts.

The output of structured extraction should be converted to Markdown with explicit row headers before indexing, not stored as raw CSV. A chunk that reads "Table 3: Quarterly Revenue by Region — Q1: North America $15.2M, Europe $9.8M; Q2: North America $18.7M, Europe $11.2M" is far more retrievable than the raw CSV equivalent, because query language uses natural language, not CSV syntax.

Layout-aware parsing for complex documents

Unstructured.io provides a document parsing library that classifies document elements (Title, NarrativeText, Table, Image, ListItem) before chunking. For tables, it uses a combination of coordinate analysis and visual detection to identify table regions, then extracts them as HTML with preserved structure. The open-source version handles most common layouts; the hosted API adds a more accurate table detection model for edge cases.

Docling (IBM Research, open-source) is worth evaluating for mixed-content documents. It runs a document layout analysis model (DocLayNet-based) to segment pages into regions, then applies specialized extractors to each region type. Table accuracy on complex layouts with merged cells is noticeably better than pdfplumber on the same documents. The tradeoff is runtime: Docling is 3-5x slower than pdfplumber per page due to the neural layout analysis step.

Marker (open-source, Datalab) converts PDFs to structured Markdown using a pipeline of specialized models: a layout detection model identifies regions, a table recognition model extracts table structure, and a text recognition model handles OCR for scanned content. Marker's Markdown output is well-suited for RAG chunking because the structural information (headings, table formatting, list nesting) is preserved as Markdown syntax rather than discarded. On a benchmark of 500 mixed native/scanned financial PDFs, Marker produced indexable table Markdown in 87% of cases vs 63% for naive text extraction.

VLM fallback for complex or scanned tables

When structured extraction produces inconsistent output — variable column counts between rows, more than 20% empty cells, or cells with concatenated values from adjacent rows — fall back to VLM extraction. Send the table as a cropped image with a precise prompt:

TABLE_EXTRACTION_PROMPT = """Extract this table to Markdown format.
Rules:
- Use | as column separator
- First row must be the header row
- Reproduce all cell values exactly — do not round numbers or abbreviate text
- If cells are merged, repeat the value in each affected cell
- If the header spans multiple rows, flatten to a single header row
- Output only the Markdown table, nothing else"""

GPT-4o and Claude 3.5 Sonnet both achieve over 90% cell-level accuracy on standard tables and above 75% on tables with merged cells in our internal benchmarks. This is meaningfully better than any structured extractor on visually complex tables. The tradeoff is $0.01-0.05 per table image processed, which is acceptable for high-value documents but not economical for bulk ingestion of simple tables.

Lesson learned

The pattern that works in production: run pdfplumber first and validate the output with a simple heuristic (check that column count is consistent across all rows and that no more than 15% of cells are empty). If validation fails, route to Docling. If Docling also fails the same validation, route to VLM extraction. This three-tier fallback handles 95%+ of enterprise PDF tables without manual triage, and the VLM tier — the expensive one — fires for only the 5-10% of tables that actually need it.

Image embeddings: CLIP, SigLIP, OpenCLIP

Visual embedding models allow you to encode images into the same vector space as text queries, enabling semantic retrieval across modalities. The landscape in 2026 has matured considerably from original CLIP, and the right choice depends on what kind of visual content you are indexing.

Original CLIP (OpenAI, 2021) remains widely used but is showing its age. It was trained on 400M internet image-text pairs, which means it has broad coverage of natural images but relatively weak understanding of document-specific content like tables, dense text on page backgrounds, and technical diagrams. For natural image retrieval (product photos, site photographs, equipment images), it is still adequate. For document page retrieval, it underperforms modern alternatives.

SigLIP (Google, 2023) replaces CLIP's softmax contrastive loss with a sigmoid loss, enabling training with larger batch sizes and producing better-calibrated similarity scores. SigLIP-So400m/patch14-384 is the strongest public checkpoint for general visual understanding tasks and significantly outperforms CLIP ViT-L/14 on retrieval benchmarks that mix document and natural image content.

OpenCLIP ViT-H/14 (LAION) trained on the LAION-5B dataset achieves state-of-the-art performance on several zero-shot retrieval benchmarks and is worth evaluating for knowledge bases that mix high-resolution natural images with document content. The ViT-H/14 variant produces 1024-dimensional embeddings — twice the size of ViT-L/14 — which increases storage cost but improves retrieval precision on fine-grained queries.

For document-specific page retrieval, none of these are the right choice — ColPali and ColQwen2 are purpose-built for this and outperform them substantially on ViDoRe. Where CLIP and SigLIP are appropriate is in hybrid knowledge bases that contain both photographs and document pages, or in use cases where the query describes visual content ("show me photos where the seal is worn") rather than document content ("find the table showing pressure tolerances").

A practical recommendation: if your knowledge base contains more than 20% natural images (as opposed to document pages), build a two-tower retrieval approach. Use SigLIP or OpenCLIP for the image retrieval tower and your standard text embedding model for the text tower. Fuse results with reciprocal rank fusion before the reranker, the same pattern described in our article on hybrid search and reranking. If your knowledge base is purely document pages, skip CLIP/SigLIP and evaluate ColPali directly.

Production architecture: multi-index retrieval and query routing

Here is the production architecture we build for enterprise multimodal RAG. It is not the simplest possible design — but simplicity that produces wrong answers on half the corpus is not actually simple.

Ingestion pipeline

Every incoming document goes through a classifier before any indexing occurs. The classifier is a simple rule-based system, not a neural model:

If text extraction returns more than 100 characters per page on average: text-native document. Extract text, run structured table extraction, chunk, embed with your standard text model.
If text extraction returns fewer than 100 characters per page: scanned document. Run OCR (Azure Document Intelligence or Google Document AI for highest accuracy) and route complex pages through VLM ingestion.
If the document contains figure-heavy pages (image area greater than 40% of page area, detectable via PyMuPDF): augment with VLM descriptions for those pages.

Each document produces multiple types of indexable chunks:

Text chunks: standard paragraph-level chunks with contextual headers (document title, section heading, page number prepended to each chunk).
Table chunks: each extracted table as Markdown, with a header that names the table and its section context.
Visual description chunks: VLM-generated descriptions of figure and diagram pages, tagged with the source page number and document.
ColPali embeddings (optional): page-level visual embeddings for visually complex documents, stored in a separate Qdrant multi-vector collection.

Retrieval and query routing

At query time, a lightweight query classifier determines the retrieval strategy:

Text queries (most queries): hybrid BM25 + dense retrieval over the text and table index, reranked with a cross-encoder. This is the standard path and handles the majority of production traffic.
Visual queries (queries containing spatial or figure references: "the diagram on", "the chart showing", "the drawing labeled"): route to the ColPali index first, merge with text retrieval results using RRF.
Table queries (queries with numerical precision requirements, explicit table references): retrieve from the table-specific index with both BM25 and dense retrieval, then send both the table Markdown chunk and the source page image to the generation model.

For generation, route requests that retrieved visual content to a multimodal LLM (GPT-4o or Claude 3.5 Sonnet). Pass both the text chunks and the source page images. For text-only retrievals, a standard LLM is sufficient and cheaper. This modality-aware routing reduces the fraction of queries that consume image tokens at generation time from 100% to roughly 15-30% in typical enterprise corpora, which has a significant impact on per-query cost.

Observability requirements

Multimodal pipelines are harder to debug than text-only ones. At minimum, each query trace should record: which retrieval index was used, the retrieval scores from each modality, whether the generation step received images, the VLM model used for generation, and the final token counts. Without this, when a user reports a wrong answer on a visual query, you have no way to know whether the failure was in routing, retrieval, table extraction quality, or generation. The principle is the same as what we describe for text RAG observability — applied to a more complex pipeline.

Approach comparison

Criterion	OCR + chunking	VLM ingestion	ColPali / ColQwen2
Production maturity	High	Good	Emerging
Visual understanding	Weak	Excellent	Good
Table accuracy	Fragile	High	Variable
Ingestion cost per page	Low ($0.001)	High ($0.01-0.05)	Medium (compute)
Query latency	Low (~200ms)	High (2-5s)	Medium (500ms-1s)
Vector index size	Small	Small (text output)	10-50x larger
Integration complexity	Low	Medium	High
Best for	Simple scanned text, incremental addition to existing pipeline	Complex tables, diagrams, high-accuracy document understanding	Visually rich corpora, slides, reports, no OCR tolerance

Cost and latency reality check

The cost differential between text-only and multimodal RAG is real and worth calculating before you commit to an architecture. The 3-8x figure cited frequently in documentation is a reasonable approximation, but the actual multiplier depends on your specific document mix and operational choices.

Ingestion cost. Text extraction from a 10,000-page corpus with a standard embedding model costs roughly $5-20 in embedding API calls. The same corpus with VLM-based ingestion for all pages costs $100-500 depending on the model and resolution. With the selective routing approach described above — only routing visually complex pages through the VLM — the cost typically falls to $30-100 for a corpus with 30% visual content, which is the more realistic enterprise baseline.

Per-query cost. A text-only RAG query against a well-optimized pipeline costs roughly $0.002-0.005 in LLM tokens (assuming a 2,000-token context window with Claude Haiku or GPT-4o-mini for non-complex queries). A query that routes to a multimodal model and passes two page images costs $0.01-0.03 depending on image size and model tier. At 10,000 daily queries with 25% hitting visual content, that is an additional $15-60/day — meaningful but rarely a blocker if the visual query answers are worth getting right.

Latency budget. For text-only retrieval, a well-instrumented pipeline returns responses in 800ms-1.5s at P95 including LLM generation. VLM-based generation with image inputs adds 1.5-3s of latency due to image token processing. This rules out multimodal generation for real-time conversational applications with sub-2s SLAs. For knowledge retrieval and document search use cases — where 3-5s response times are acceptable — it is fine. Design your SLAs around the actual use case, not an abstract performance goal.

Storage. ColPali embeddings for a 10,000-page corpus require approximately 1.3GB of vector storage. Most managed vector databases (Qdrant Cloud, Pinecone, Weaviate) price storage at $0.025-0.10/GB/month. The ColPali index costs roughly $0.03-0.13/month additional at this scale — negligible. At 1M pages it starts to matter: 130GB of vector storage at $0.10/GB/month is $13/month for the index alone, before compute costs for MaxSim scoring.

When not to build multimodal RAG

Most teams should not build multimodal RAG as their first RAG project, and many teams with existing text-only RAG do not need it either. Before investing in multimodal infrastructure, you should be able to answer yes to at least two of the following:

Your document audit shows that more than 25% of documents contain visual elements that carry information critical to answering real user queries.
You have specific, measurable user queries that fail today due to visual content — not hypothetical ones, actual production failures.
Your text-only RAG pipeline is already performing well (faithfulness above 0.80 on your eval set, low hallucination rate on text-native queries). You should not add multimodal complexity on top of a text pipeline that is not yet solid.
You have the engineering bandwidth to maintain a more complex pipeline — separate ingestion paths, multi-index retrieval, query routing logic, and more expensive observability requirements.

Lesson learned

We have declined to build multimodal RAG on three projects in the past year where teams wanted it primarily because they had seen ColPali benchmarks and found them impressive. In each case, a document audit showed that fewer than 15% of documents had significant visual content, and the failing queries were due to chunking strategy and lack of hybrid search — not visual blindness. Fixing those two problems was cheaper, faster, and produced a larger quality improvement than multimodal would have. Do the audit first.

If you are at the evaluation stage, a focused 4-6 week proof of concept is the right structure: select 100-300 representative documents, build 30-50 ground truth question-answer pairs that require visual understanding, measure baseline text-only retrieval accuracy on those pairs, then add the simplest viable multimodal layer (usually VLM-based ingestion, not ColPali) and measure the delta. The delta tells you whether the investment is justified before you architect a production system around it.

Conclusion

Multimodal RAG is not a research curiosity in 2026. The tools are production-ready: Marker and Docling for document parsing, VLMs for table extraction and visual description, ColPali and ColQwen2 for native visual retrieval, Qdrant for multi-vector indexing. The pattern of combining these with a routing layer over a hybrid text index is deployable today and works in production.

But it is not a drop-in upgrade. It requires a more complex ingestion pipeline, more expensive infrastructure, and more disciplined observability than text-only RAG. The correct sequencing is: ship text-only RAG with solid evaluation infrastructure first, audit your document corpus to quantify visual content, identify specific failing queries that require visual understanding, and then add multimodal capability incrementally where the evidence justifies it.

If your current RAG system fails on scanned pages or complex tables, that is not a model problem. It is a pipeline design problem. And multimodal architecture, applied selectively, is how you fix it.

Frequently asked questions

Multimodal RAG extends standard Retrieval-Augmented Generation to handle non-text content: scanned PDFs, tables, diagrams, photos, and annotated figures. Instead of indexing only extracted text, it uses OCR, vision language models, or visual embedding models to make image and layout content searchable. The retriever returns both text chunks and visual page images, and a multimodal LLM generates answers grounded in both modalities.

Text-only RAG fails when critical information lives in the visual layer: scanned contracts and invoices (no text layer), complex tables with merged cells (structure destroyed by naive extraction), technical schematics and engineering drawings (meaning is spatial), and annotated images like site photos. In enterprise knowledge bases, 30-60% of documents typically contain at least one such element.

ColPali encodes full document page images directly into multi-vector embeddings using a PaliGemma backbone — no OCR required. Each page produces a 1030-token embedding matrix, enabling fine-grained late-interaction scoring against query embeddings (similar to ColBERT for text). On the ViDoRe benchmark, ColPali significantly outperforms text extraction baselines on slides, financial reports, and technical diagrams. ColQwen2 extends this with stronger multilingual and table understanding.

For native PDFs: pdfplumber and Camelot for simple tables, Docling (IBM Research) and Unstructured.io for complex layout-aware extraction. Marker (open-source) converts PDFs to structured Markdown with high table accuracy. For scanned or visually complex tables, fall back to VLM extraction with GPT-4o or Claude 3.5 Sonnet. The recommended production pattern: pdfplumber first, Docling as second tier, VLM as final fallback for the 5-10% of tables that defeat structured extractors.

For document page retrieval, use ColPali or ColQwen2 — CLIP and SigLIP are trained on natural images and underperform on dense text pages. For knowledge bases mixing photographs with documents, SigLIP-So400m/patch14-384 and OpenCLIP ViT-H/14 both outperform original CLIP. The practical recommendation: use a two-tower approach with SigLIP for natural image chunks and your standard text embedder for document chunks, fusing results with RRF before reranking.

Roughly 3-8x more at ingestion time, 2-4x more per query for queries that use visual generation. VLM-based ingestion costs $0.01-0.05 per page vs $0.001 for text extraction. On 10,000 pages: $100-500 vs $5-20. With selective routing (VLM only for visually complex pages), total ingestion cost typically drops to $30-100 for a corpus with 30% visual content. The ROI justification: the cost is fixed at ingestion; the benefit accrues on every query that would otherwise return a wrong or incomplete answer.

Multimodal RAG: Retrieving from Images, PDFs, and Tables

Why text-only RAG fails on enterprise documents

The three approaches to multimodal RAG

VLM-based ingestion: GPT-4o, Claude 3.5 Sonnet, Gemini

ColPali and ColQwen2: native visual retrieval

Table extraction strategies

Structured extraction for native PDFs

Layout-aware parsing for complex documents

VLM fallback for complex or scanned tables

Image embeddings: CLIP, SigLIP, OpenCLIP

Production architecture: multi-index retrieval and query routing

Ingestion pipeline

Retrieval and query routing

Observability requirements

Approach comparison

Cost and latency reality check

When not to build multimodal RAG

Conclusion

Frequently asked questions

Further reading