Manual data entry from PDFs is still one of the most time-consuming and error-prone tasks in most SMBs and mid-market companies. Thirty supplier purchase orders in thirty different formats, paper invoices scanned to a shared drive, 80-page contracts to review before signing, technical reports to archive in the ERP. Each document processed manually costs between 3 and 15 minutes of qualified staff time, multiplied by tens of thousands of pages per year.
AI-based PDF data extraction is not a new concept, but its reliable industrialization is. In 2026, the available stacks have matured substantially: native parsers, layout-aware OCR, multimodal foundation models, and structured-output frameworks now make it possible to build pipelines that are reliable, measurable, and maintainable. The challenge is no longer purely technical. It is architectural: choosing the right component for the right case, and instrumenting the system to catch errors before they reach production.
This article covers the two-branch architecture that applies to any PDF extraction project (native vs. scanned), the criteria for selecting a stack based on actual volume, honest evaluation metrics, and the failure modes that derail projects weeks after launch. Whether you are a CTO, IT lead, or senior engineer at an SMB or mid-market company, this guide helps you scope your project correctly before committing to it.
Key takeaways
- Every PDF extraction pipeline starts with a fork: native PDF (direct parser) vs. scanned PDF (OCR required). Processing both through the same path generates silent errors.
- GPT-4o vision covers low-volume, high-variety use cases. Azure Document Intelligence becomes the right call at 10,000+ pages per month or for recurring, typed forms.
- OCR CER must be measured independently before optimizing extraction. A CER above 3% degrades everything downstream, regardless of LLM quality.
- A per-field confidence threshold with HITL is the only way to guarantee acceptable data quality in production without 100% supervision.
- The most common trap: underestimating format variability. 30 suppliers can mean 30 purchase order variants, each requiring its own calibration.
1. Why PDF extraction is still a live problem in 2026
PDF is the dominant format for business document exchange. Invoices, contracts, purchase orders, audit reports, product data sheets, medical records, legal deeds. In an SMB processing 1,000 to 50,000 pages per year, a significant share of those documents still enters through a human operator: manual entry into the ERP, copy-paste between applications, a consolidation spreadsheet maintained by someone who has other things to do.
The problem is not a lack of solutions. It is the quality and reliability of production deployments. Dozens of extraction tools exist, but the majority of projects fail for identifiable reasons: incorrect document-type detection, uninstrumented OCR, absent confidence thresholds, silent drift when supplier formats change.
In 2026, the technical conditions for building robust pipelines are firmly in place. Layout-aware OCR models reach industrial-grade accuracy. Multimodal LLMs handle complex documents zero-shot. Structured-output frameworks (instructor, Pydantic v2) lock down output format. What is still missing from most implementations is the decision architecture and the metric instrumentation.
That is exactly what this article covers. Before getting into stacks, let us map the actual problem space.
2. Problem landscape: native, scanned, structured, semi-structured
There is no single PDF extraction problem. There are four, depending on the combination of two dimensions: how the document was generated, and how structured its content is.
Native PDF vs. scanned PDF
A native PDF is produced by software (Word, LibreOffice, an ERP, an invoicing tool). It contains a digital text layer: pdfplumber or PyMuPDF extract its content directly, with near-perfect text accuracy and reasonable preservation of simple tables.
A scanned PDF is an image of a paper document. There is no text layer: all content is encoded in pixels. Extraction requires an OCR (Optical Character Recognition) step before any semantic analysis. The quality of that OCR determines everything that follows.
The distinction is not always obvious at a glance. A PDF can appear digital but actually be an embedded image. Automatic detection relies on a heuristic: if the density of selectable text (via pdfminer) falls below a threshold, the document is routed to the scan branch.
Structured vs. semi-structured
A structured document has a stable, predictable layout: an invoice produced by the same accounting software from the same supplier for three years always places the invoice number in the same position. A layout-aware model fine-tuned on that format can extract it with 98%+ accuracy.
A semi-structured document has recognizable logic but variable layout: a purchase order whose structure differs by supplier, a contract whose clauses are always present but in a different order, a technical report with recurring sections but different templates by firm. These documents require an LLM capable of semantic reasoning, not just spatial position recognition.
Decision matrix
| Structured | Semi-structured | |
|---|---|---|
| Native PDF | pdfplumber + dedicated model or business rules | pdfplumber + LLM structured output |
| Scanned PDF | OCR + Azure DI Custom Model or LayoutLM | OCR + LLM structured output (with reinforced HITL) |
3. The two-branch architecture
Below is the reference architecture we deploy at Tensoria for PDF extraction projects. It is organized around a central fork: automatic document-type detection determines which processing path is taken.
Source: manual upload, email attachment, FTP supplier feed, partner API, physical scan
|
Ingestion + type detection
(text density heuristic: native if > threshold, scanned otherwise)
|
+-----------------------------+--------------------------------------+
| Branch A -- Native PDF | Branch B -- Scanned PDF |
| | |
| pdfplumber / PyMuPDF | Image pre-processing |
| -> text + structure | (deskew, binarization, 300 dpi) |
| (tables, headings) | -> layout-aware OCR |
| | (Tesseract 5, Azure DI, |
| | Google Document AI) |
| | -> text + spatial coordinates |
+-----------------------------+--------------------------------------+
|
Extraction model (selected by volume and document type)
- Varied, low-frequency documents: LLM + JSON Schema
- Recurring forms (low volume): GPT-4o vision direct
- Recurring forms (medium volume): Azure DI Custom Model
- Recurring forms (high volume): LayoutLM v3 / Donut self-hosted
|
Post-processing
- Business validation (amounts, dates, formats)
- Per-field confidence score
- Flagging of ambiguous fields -> HITL queue
|
Output: structured JSON -> ERP, SQL database, DMS, RAG
Document-type detection: a critical step that is routinely skipped
Most naive implementations skip this step entirely. They send all PDFs through the same pipeline. The result: scanned PDFs pass through the text parser and come out empty or garbled, often without triggering an explicit error. Missing data arrives silently in the ERP.
Correct detection relies on two combined criteria: the density of selectable text (characters per page) and the presence of full-page image objects. A threshold of 100 characters per page is a solid starting point. Below it, the pipeline switches to the OCR branch.
Image pre-processing before OCR
On the scan branch, image quality determines everything. Systematic pre-processing before OCR materially improves results:
- Deskewing: correcting tilt (a document scanned at 2 degrees of skew measurably degrades CER)
- Adaptive binarization: converting to black-and-white with local thresholding for documents with non-uniform backgrounds
- Noise suppression: filtering scanning artifacts (dots, streaks)
- Resolution: upscaling when the source resolution is below 150 dpi (bicubic interpolation or super-resolution)
OpenCV provides most of these operations. For cloud pipelines, Azure Document Intelligence integrates this pre-processing natively.
4. Choosing a stack based on actual volume
Monthly page volume is the most objective selection criterion. It determines both the project economics and the level of investment in annotation and fine-tuning that is justifiable.
Low volume: GPT-4o vision direct
For varied documents at low volume (under 5,000 pages per month), GPT-4o vision in direct mode is the fastest path to production. Convert the PDF to images (one image per page), send the images with a structured prompt, receive a JSON output validated by Pydantic.
Advantages: zero annotation, zero fine-tuning, deployment in a few days, strong handling of varied formats. Limitations: high cost at scale (0.02 to 0.10 euros per document depending on size), degraded accuracy on dense tables or complex layouts, latency of 5 to 30 seconds per document.
This approach is documented in the official Azure OpenAI documentation for invoice extraction from PDFs.
Medium volume: Azure Document Intelligence
Azure Document Intelligence (formerly Form Recognizer) is the standard choice for intermediate volumes (5,000 to 200,000 pages per month) with recurring, typed forms. Its Layout model extracts the full document structure (tables, paragraphs, headings, checkboxes) as structured Markdown. Its Custom Model fine-tunes on 5 to 10 labeled examples to recognize the specific fields in your documents.
Advantages: natively layout-aware, per-field confidence scores included, guaranteed SLA, Azure compliance with EU-region hosting, simple REST API. Cost is 0.01 euros per page for the Layout model, 0.015 euros for Custom Models. At 50,000 pages per month, the API cost is 500 to 750 euros per month, far below the cost of equivalent manual data entry.
The main limitation: Azure DI extracts structure but does not understand semantics. For fields whose meaning is contextual (a contractual clause, a warranty condition), a downstream LLM pass is necessary.
High volume or sensitive data: LayoutLM v3 or Donut self-hosted
Beyond 200,000 pages per month, or for highly sensitive data requiring full on-premise processing, fine-tuned open-source models become the economically rational option.
LayoutLM v3 (Microsoft Research) is a transformer that jointly encodes text, spatial position, and visual features. It excels on forms with a stable layout and labeled fields. It requires an annotated dataset of 500 to 2,000 documents per form type, and fine-tuning on GPU (A100 or H100, 4 to 12 hours depending on dataset size).
Donut (Document Understanding Transformer) processes the document directly as an image without a prior OCR step. It is more robust on visually complex documents or medium-quality scans, because OCR errors cannot propagate. The trade-off is a higher training data requirement (1,000 to 3,000 annotated examples for complex types).
Marginal cost at scale is near zero once infrastructure is in place. An A10G instance on AWS or Scaleway costs 1.50 to 2.50 euros per hour, or 1,100 to 1,800 euros per month for continuous availability.
AI PDF extraction stack comparison
| Stack | Target volume | Annotation required | Cost/page | Data privacy |
|---|---|---|---|---|
| GPT-4o vision | < 5,000 pages/month | None | 0.02 to 0.10 euros | Cloud (enterprise plan available) |
| Azure DI Layout | 5,000 to 200,000 pages/month | 5 to 10 examples | 0.01 to 0.015 euros | Azure EU configurable |
| LayoutLM v3 fine-tuned | > 200,000 pages/month | 500 to 2,000 docs | < 0.001 euros at scale | On-premise possible |
| Donut self-hosted | > 200,000 pages/month | 1,000 to 3,000 docs | < 0.001 euros at scale | On-premise, no third-party OCR |
Marker, Unstructured.io, and LlamaParse
Marker is an open-source PDF-to-Markdown converter that performs well on well-formed native PDFs. It handles tables, formulas, and complex layouts with better quality than pdfplumber on typographically rich documents. It includes no semantic layer: it is used as a parser before an LLM.
Unstructured.io is a document ingestion framework that unifies processing paths across formats (PDF, Word, PowerPoint, HTML, email) into a common API. Useful when the pipeline needs to handle heterogeneous formats, though the abstraction layer can mask OCR quality issues.
LlamaParse is a cloud parser optimized for RAG pipelines, with solid table and structured-element handling. It is relevant when the goal is indexing for an AI assistant rather than field extraction to an ERP. Our article on multimodal RAG with images, PDFs, and tables covers these architectures in detail.
5. Structured output: Pydantic, instructor, and JSON Schema
Unstructured extraction produces text. Structured extraction produces data. The difference is critical the moment you target automatic integration with an ERP or SQL database: a badly formatted date, an amount with a comma instead of a decimal point, a null field where the ERP expects an empty string. All of these generate production errors.
The base pattern with JSON Schema
Modern LLMs (GPT-4o, Claude Sonnet, Mistral Large) accept a JSON Schema as an API parameter. The model commits to producing output that conforms to that schema. This is the first layer of safety.
But LLMs can still violate semantic constraints: producing a future date for an "issue date" field, or a negative amount for a pre-tax total. That is where the upper layers come in.
For a deep-dive into production patterns for structured LLM outputs, see our article on structured outputs in LLM production systems.
Instructor and Pydantic for output validation
Instructor is a Python library that wraps LLM calls with automatic Pydantic validation and retry logic. If the LLM produces invalid output, instructor relaunches the request with the validation error message in context, allowing the model to self-correct. In practice, 95 to 99% of outputs pass on the first attempt, and the remaining 1 to 5% pass on the second.
from pydantic import BaseModel, Field, field_validator
from datetime import date
from typing import Optional
class InvoiceLine(BaseModel):
reference: str
description: str
quantity: float = Field(gt=0)
unit_price_excl_tax: float = Field(gt=0)
class Invoice(BaseModel):
number: str
supplier: str
issue_date: date
due_date: Optional[date]
lines: list[InvoiceLine]
total_excl_tax: float = Field(gt=0)
vat_rate: float = Field(ge=0, le=1)
total_incl_tax: float = Field(gt=0)
global_confidence: float = Field(ge=0, le=1)
fields_to_review: list[str] = []
@field_validator("total_incl_tax")
def verify_vat_consistency(cls, v, info):
if "total_excl_tax" in info.data and "vat_rate" in info.data:
expected = info.data["total_excl_tax"] * (1 + info.data["vat_rate"])
if abs(v - expected) > 0.05:
raise ValueError("Total incl. tax inconsistent with excl. tax and VAT rate")
return v
The arithmetic validation (VAT consistency check) is a safety net that LLMs alone do not guarantee. It prevents silently incorrect extraction results from reaching the accounting system.
6. Honest evaluation metrics
A pipeline without an instrumented evaluation set is a pipeline whose real quality is unknown. This is the situation most projects find themselves in: a few manual tests on sample documents, then deployment. Problems surface in production, often weeks later.
Precision and recall per field
Precision and recall must be calculated at the field level, not the document level. A document extracted "correctly" at 80% accuracy can have a systematically wrong amount field. Without per-field measurement, that problem remains invisible.
- Per-field precision: among the values extracted for this field, what proportion is correct?
- Per-field recall: among the values that should have been extracted, what proportion was?
- Straight-through rate: proportion of documents processed without human intervention
Realistic targets for a production pipeline: precision above 94% on key fields (amount, date, reference number), recall above 90%, straight-through rate between 80 and 90%.
OCR CER: the independent metric to measure first
The Character Error Rate measures OCR quality independently of everything downstream. It is calculated on a manually annotated document sample:
CER = (insertions + deletions + substitutions) / total reference characters
A CER below 3% is the production target. Above 5%, recognition errors propagate systematically into extraction: a "0" read as "O", a "1" read as "l", an amount turned into a string that the Pydantic validator rejects.
Target metrics for a production PDF extraction pipeline
| Metric | Production target |
|---|---|
| Per-field precision on key fields (amount, date, reference) | > 94% |
| Recall (no key field missed) | > 90% |
| Straight-through processing rate | 80 to 90% |
| Downstream error rate (ERP) | < 1% |
| OCR CER (scan branch) | < 3% |
| Latency per document (under 10 pages) | < 10 s |
| Cost per processed page | < 0.05 euros |
The evaluation set must be built before pipeline development, on a stratified sample that represents the real variability of your documents: different suppliers, different years, different scan qualities. This set must never be used for training.
Cost per document as an operational metric
Beyond accuracy metrics, cost per processed document is the metric that justifies and calibrates the investment. It breaks down into API cost (OCR plus LLM), infrastructure cost, and human review cost (HITL). A cost below 0.05 euros per page is generally the threshold below which automation becomes economically obvious compared to manual entry.
7. HITL and confidence thresholds
Human in the Loop is not an admission that the automated system failed. It is the mechanism that keeps production data quality acceptable without total supervision. A pipeline without HITL forces a choice between two bad options: deploy data of insufficient quality, or maintain exhaustive human oversight that eliminates the automation gain.
How to set the confidence threshold
The confidence threshold applies per field, not per document. A document may have an overall score of 0.92 but a "contract number" field at 0.71: that field should be flagged for review, not the entire document.
Threshold calibration depends on the cost of a downstream error. For an amount field feeding an accounting system, a threshold of 0.90 is reasonable. For a comment field going into a CRM, 0.75 may be sufficient. The general rule: set the threshold to keep the human review rate between 10 and 20% of documents. Below that, you are taking too much risk. Above it, the economic gain erodes.
The review interface
A HITL without an efficient review interface turns human review into a slow, painful task. A minimal interface must display the source document, highlight extracted zones, allow inline correction, and log corrections to feed future evaluations. Tools like Label Studio or custom interfaces built with FastAPI and htmx serve this purpose well.
Accumulated corrections are a valuable data source for improving models over time. That is the virtuous loop: HITL improves the data, which improves the model, which reduces the HITL rate.
8. Downstream integration: RAG or ERP
PDF extraction is not an end in itself. Extracted data has two primary destinations: integration into a transactional system (ERP, accounting, CRM) or indexing into a document search system (RAG).
ERP integration
Integration toward an ERP (SAP, Sage, Odoo, NetSuite, Cegid) is the most common use case for supplier invoices and purchase orders. Technical prerequisites: a REST API or import mechanism (EDI file, native connector) on the ERP, and a mapping layer between extracted fields and the ERP data model.
SAP has a REST API from SAP S/4HANA onwards. Sage 100 and Sage X3 offer APIs and native XML/CSV imports. Odoo is the most permissive with a complete JSON-RPC API. Legacy on-premise ERPs without documented APIs are the most frequent source of project timeline overruns in practice.
Indexing for a RAG document system
When the goal is to make a PDF corpus queryable in natural language (contracts, reports, data sheets, archives), extraction feeds a RAG pipeline. In that case, the objective is not structured field extraction but preserving semantic structure for chunking and vector indexing.
The two use cases can coexist: structured fields (amount, date, reference) are stored in a relational database for transactional queries, and the full text is indexed in a vector store for semantic queries. Our article on multimodal RAG with images, PDFs, and tables describes this architecture in detail. For a broader view of RAG deployment costs and failure modes, see our guides on RAG project costs and TCO and production RAG failure modes.
9. Realistic costs, timelines, and TCO
The figures below come from projects we have delivered at Tensoria for SMBs and mid-market companies. They cover a typical scope: one to three document types, a volume of 10,000 to 100,000 pages per year, integration with an ERP or accounting system.
Proof of concept (4 to 6 weeks): 6,000 to 12,000 euros
The POC covers annotation of 200 to 500 representative documents, construction of the OCR and extraction pipeline, constitution of the evaluation set, and a minimal review interface. At the end of the POC, you have real metrics (CER, per-field precision, straight-through rate) and a reliable estimate for the MVP.
Production MVP (2 to 3 months): 15,000 to 30,000 euros
The MVP includes ERP or accounting integration, the human validation workflow (HITL), handling of multiple document types, and production monitoring. This is the first system actually deployed to operational teams.
Annual TCO at scale
The annual operating cost breaks down into three items:
- API costs (Azure DI plus LLM): 100 to 800 euros per month depending on volume
- Model maintenance and adding new document types: 3 to 5 days per year
- Residual human review (10 to 20% of documents): pooled with existing operations teams
Annual TCO falls between 10,000 and 25,000 euros. For an SMB that manually processes 20,000 documents per year at 5 minutes of data entry each and a fully-loaded staff cost of 30 euros per hour, the gross saving is 50,000 euros. ROI is typically reached in 6 to 12 months.
What stretches the timeline
The most frequent sources of delay: format heterogeneity (30 suppliers, 30 purchase-order variants, each requiring its own calibration), poor quality of legacy scans requiring specific image pre-processing, integration with a legacy on-premise ERP without a documented REST API, and GDPR review for documents containing personal data (contracts, HR records, health data).
10. The pitfalls that derail production deployments
These pitfalls are drawn from real projects. They are not theoretical.
Underestimating document variability
The "standard" purchase order from your main supplier is not the same as those from your other 29 suppliers. Each new format can degrade metrics if the model has not been evaluated on it. The fix: build an evaluation set that covers real variability from the start, and systematically add every new format detected in production.
Not measuring OCR CER
Assuming the OCR is correct without measuring CER on an annotated sample. A document scanned at 72 dpi with a folded corner generates OCR errors that propagate all the way to production. Without an independent CER measurement, you spend weeks optimizing the wrong component.
Running native and scanned PDFs through the same pipeline
The two document types require radically different approaches. Mixing them without automatic type detection generates silent errors: the text parser does not raise an error on an image-PDF, it simply returns empty text or gibberish.
Not versioning document types
Suppliers change their invoice formats, often without warning. Without format drift detection (via comparison of extracted field distributions) and alerting, extracted fields can be silently wrong for weeks before an accounting reconciliation reveals the problem.
Unbounded scope from day one
"All PDFs in the company" is an unmanageable scope. Start with one document type on one flow (for example, supplier invoices received by email) and iterate. Each document type is a distinct project with its own metrics, its own evaluation set, and its own validation workflow.
Pre-launch checklist for an AI PDF extraction project
- ->Inventory of document types and volumes by type (pages/month)
- ->Sample of 50 to 100 representative documents per type, including degraded cases
- ->Identification of the data destination (ERP, SQL database, RAG) and integration constraints
- ->GDPR review if documents contain personal data
- ->Definition of key fields to extract and per-field acceptance criteria
- ->Identification of the teams that will handle HITL review and the time they can commit
For companies whose PDFs contain personal data (customer contracts, HR records, health data), a Data Protection Impact Assessment is recommended. The ICO publishes a comprehensive DPIA guide covering evaluation criteria. For teams handling sensitive documents in the EU and needing guidance on EU AI Act obligations, our article on EU AI Act compliance covers the risk tiers and documentation requirements.
Applications by sector
AI-based PDF extraction applies across any sector with a structured document flow of sufficient volume. The most common cases we encounter:
- Accounting firms and audit practices: supplier and client invoices, extraction of tax fields (VAT, pre-tax amount, invoice date, invoice number) for integration into accounting journals
- B2B distributors and manufacturers: multi-supplier purchase orders in heterogeneous formats, standardized extraction for ERP integration
- Legal and notarial practices: deeds, contracts, diagnostics, extraction of clauses, amounts, party identities, and key dates
- Construction and engineering firms: specifications, bills of quantities, technical reports, extraction of specifications, quantities, and equipment references
- Insurance brokers: policy schedules, extraction of coverage, deductibles, and effective dates for CRM population
- E-commerce and logistics: supplier delivery notes, automatic reconciliation with orders in the WMS
FAQ: AI PDF data extraction
Further reading
- Multimodal RAG: Images, PDFs, and Tables in an AI Assistant: architecture for making document corpora queryable in natural language, with table and image handling.
- Structured Outputs in LLM Production Systems: JSON Schema, instructor, and Pydantic patterns for reliable field extraction at scale.
- Production RAG Failure Modes: the retrieval and chunking errors that break document pipelines in production.
- RAG Project Costs and TCO: how to estimate the full cost of a document intelligence system before committing.
- Self-Hosted RAG Architecture: when on-premise deployment is required for data sensitivity or volume economics.
- EU AI Act Compliance Guide: obligations and documentation requirements for AI systems processing business documents in Europe.
- Enterprise RAG Use Cases and ROI: the business cases that generate the fastest return on document AI investments.
- Tensoria RAG systems service: end-to-end deployment of document intelligence pipelines, from OCR to ERP integration.
- AI audit: structured review of your document flows to identify where extraction automation creates the most value.
Running a PDF-heavy workflow?
At Tensoria, we scope document extraction projects from the POC stage: document-type inventory, stack benchmarking against your actual volume, honest metrics on your real documents. Results in 4 to 6 weeks.