Tensoria
AI Engineering By Anas R.

PDF Data Extraction with AI: Architecture, Stacks and Costs 2026

Manual data entry from PDFs is still one of the most time-consuming and error-prone tasks in most SMBs and mid-market companies. Thirty supplier purchase orders in thirty different formats, paper invoices scanned to a shared drive, 80-page contracts to review before signing, technical reports to archive in the ERP. Each document processed manually costs between 3 and 15 minutes of qualified staff time, multiplied by tens of thousands of pages per year.

AI-based PDF data extraction is not a new concept, but its reliable industrialization is. In 2026, the available stacks have matured substantially: native parsers, layout-aware OCR, multimodal foundation models, and structured-output frameworks now make it possible to build pipelines that are reliable, measurable, and maintainable. The challenge is no longer purely technical. It is architectural: choosing the right component for the right case, and instrumenting the system to catch errors before they reach production.

This article covers the two-branch architecture that applies to any PDF extraction project (native vs. scanned), the criteria for selecting a stack based on actual volume, honest evaluation metrics, and the failure modes that derail projects weeks after launch. Whether you are a CTO, IT lead, or senior engineer at an SMB or mid-market company, this guide helps you scope your project correctly before committing to it.

Key takeaways

  • Every PDF extraction pipeline starts with a fork: native PDF (direct parser) vs. scanned PDF (OCR required). Processing both through the same path generates silent errors.
  • GPT-4o vision covers low-volume, high-variety use cases. Azure Document Intelligence becomes the right call at 10,000+ pages per month or for recurring, typed forms.
  • OCR CER must be measured independently before optimizing extraction. A CER above 3% degrades everything downstream, regardless of LLM quality.
  • A per-field confidence threshold with HITL is the only way to guarantee acceptable data quality in production without 100% supervision.
  • The most common trap: underestimating format variability. 30 suppliers can mean 30 purchase order variants, each requiring its own calibration.

1. Why PDF extraction is still a live problem in 2026

PDF is the dominant format for business document exchange. Invoices, contracts, purchase orders, audit reports, product data sheets, medical records, legal deeds. In an SMB processing 1,000 to 50,000 pages per year, a significant share of those documents still enters through a human operator: manual entry into the ERP, copy-paste between applications, a consolidation spreadsheet maintained by someone who has other things to do.

The problem is not a lack of solutions. It is the quality and reliability of production deployments. Dozens of extraction tools exist, but the majority of projects fail for identifiable reasons: incorrect document-type detection, uninstrumented OCR, absent confidence thresholds, silent drift when supplier formats change.

In 2026, the technical conditions for building robust pipelines are firmly in place. Layout-aware OCR models reach industrial-grade accuracy. Multimodal LLMs handle complex documents zero-shot. Structured-output frameworks (instructor, Pydantic v2) lock down output format. What is still missing from most implementations is the decision architecture and the metric instrumentation.

That is exactly what this article covers. Before getting into stacks, let us map the actual problem space.

2. Problem landscape: native, scanned, structured, semi-structured

There is no single PDF extraction problem. There are four, depending on the combination of two dimensions: how the document was generated, and how structured its content is.

Native PDF vs. scanned PDF

A native PDF is produced by software (Word, LibreOffice, an ERP, an invoicing tool). It contains a digital text layer: pdfplumber or PyMuPDF extract its content directly, with near-perfect text accuracy and reasonable preservation of simple tables.

A scanned PDF is an image of a paper document. There is no text layer: all content is encoded in pixels. Extraction requires an OCR (Optical Character Recognition) step before any semantic analysis. The quality of that OCR determines everything that follows.

The distinction is not always obvious at a glance. A PDF can appear digital but actually be an embedded image. Automatic detection relies on a heuristic: if the density of selectable text (via pdfminer) falls below a threshold, the document is routed to the scan branch.

Structured vs. semi-structured

A structured document has a stable, predictable layout: an invoice produced by the same accounting software from the same supplier for three years always places the invoice number in the same position. A layout-aware model fine-tuned on that format can extract it with 98%+ accuracy.

A semi-structured document has recognizable logic but variable layout: a purchase order whose structure differs by supplier, a contract whose clauses are always present but in a different order, a technical report with recurring sections but different templates by firm. These documents require an LLM capable of semantic reasoning, not just spatial position recognition.

Decision matrix

Structured Semi-structured
Native PDF pdfplumber + dedicated model or business rules pdfplumber + LLM structured output
Scanned PDF OCR + Azure DI Custom Model or LayoutLM OCR + LLM structured output (with reinforced HITL)

3. The two-branch architecture

Below is the reference architecture we deploy at Tensoria for PDF extraction projects. It is organized around a central fork: automatic document-type detection determines which processing path is taken.

Source: manual upload, email attachment, FTP supplier feed, partner API, physical scan
  |
Ingestion + type detection
  (text density heuristic: native if > threshold, scanned otherwise)
  |
+-----------------------------+--------------------------------------+
|  Branch A -- Native PDF     |  Branch B -- Scanned PDF             |
|                             |                                      |
|  pdfplumber / PyMuPDF       |  Image pre-processing                |
|  -> text + structure        |  (deskew, binarization, 300 dpi)     |
|    (tables, headings)       |  -> layout-aware OCR                 |
|                             |    (Tesseract 5, Azure DI,           |
|                             |     Google Document AI)              |
|                             |  -> text + spatial coordinates       |
+-----------------------------+--------------------------------------+
  |
Extraction model (selected by volume and document type)
  - Varied, low-frequency documents: LLM + JSON Schema
  - Recurring forms (low volume): GPT-4o vision direct
  - Recurring forms (medium volume): Azure DI Custom Model
  - Recurring forms (high volume): LayoutLM v3 / Donut self-hosted
  |
Post-processing
  - Business validation (amounts, dates, formats)
  - Per-field confidence score
  - Flagging of ambiguous fields -> HITL queue
  |
Output: structured JSON -> ERP, SQL database, DMS, RAG

Document-type detection: a critical step that is routinely skipped

Most naive implementations skip this step entirely. They send all PDFs through the same pipeline. The result: scanned PDFs pass through the text parser and come out empty or garbled, often without triggering an explicit error. Missing data arrives silently in the ERP.

Correct detection relies on two combined criteria: the density of selectable text (characters per page) and the presence of full-page image objects. A threshold of 100 characters per page is a solid starting point. Below it, the pipeline switches to the OCR branch.

Image pre-processing before OCR

On the scan branch, image quality determines everything. Systematic pre-processing before OCR materially improves results:

  • Deskewing: correcting tilt (a document scanned at 2 degrees of skew measurably degrades CER)
  • Adaptive binarization: converting to black-and-white with local thresholding for documents with non-uniform backgrounds
  • Noise suppression: filtering scanning artifacts (dots, streaks)
  • Resolution: upscaling when the source resolution is below 150 dpi (bicubic interpolation or super-resolution)

OpenCV provides most of these operations. For cloud pipelines, Azure Document Intelligence integrates this pre-processing natively.

4. Choosing a stack based on actual volume

Monthly page volume is the most objective selection criterion. It determines both the project economics and the level of investment in annotation and fine-tuning that is justifiable.

Low volume: GPT-4o vision direct

For varied documents at low volume (under 5,000 pages per month), GPT-4o vision in direct mode is the fastest path to production. Convert the PDF to images (one image per page), send the images with a structured prompt, receive a JSON output validated by Pydantic.

Advantages: zero annotation, zero fine-tuning, deployment in a few days, strong handling of varied formats. Limitations: high cost at scale (0.02 to 0.10 euros per document depending on size), degraded accuracy on dense tables or complex layouts, latency of 5 to 30 seconds per document.

This approach is documented in the official Azure OpenAI documentation for invoice extraction from PDFs.

Medium volume: Azure Document Intelligence

Azure Document Intelligence (formerly Form Recognizer) is the standard choice for intermediate volumes (5,000 to 200,000 pages per month) with recurring, typed forms. Its Layout model extracts the full document structure (tables, paragraphs, headings, checkboxes) as structured Markdown. Its Custom Model fine-tunes on 5 to 10 labeled examples to recognize the specific fields in your documents.

Advantages: natively layout-aware, per-field confidence scores included, guaranteed SLA, Azure compliance with EU-region hosting, simple REST API. Cost is 0.01 euros per page for the Layout model, 0.015 euros for Custom Models. At 50,000 pages per month, the API cost is 500 to 750 euros per month, far below the cost of equivalent manual data entry.

The main limitation: Azure DI extracts structure but does not understand semantics. For fields whose meaning is contextual (a contractual clause, a warranty condition), a downstream LLM pass is necessary.

High volume or sensitive data: LayoutLM v3 or Donut self-hosted

Beyond 200,000 pages per month, or for highly sensitive data requiring full on-premise processing, fine-tuned open-source models become the economically rational option.

LayoutLM v3 (Microsoft Research) is a transformer that jointly encodes text, spatial position, and visual features. It excels on forms with a stable layout and labeled fields. It requires an annotated dataset of 500 to 2,000 documents per form type, and fine-tuning on GPU (A100 or H100, 4 to 12 hours depending on dataset size).

Donut (Document Understanding Transformer) processes the document directly as an image without a prior OCR step. It is more robust on visually complex documents or medium-quality scans, because OCR errors cannot propagate. The trade-off is a higher training data requirement (1,000 to 3,000 annotated examples for complex types).

Marginal cost at scale is near zero once infrastructure is in place. An A10G instance on AWS or Scaleway costs 1.50 to 2.50 euros per hour, or 1,100 to 1,800 euros per month for continuous availability.

AI PDF extraction stack comparison

Stack Target volume Annotation required Cost/page Data privacy
GPT-4o vision < 5,000 pages/month None 0.02 to 0.10 euros Cloud (enterprise plan available)
Azure DI Layout 5,000 to 200,000 pages/month 5 to 10 examples 0.01 to 0.015 euros Azure EU configurable
LayoutLM v3 fine-tuned > 200,000 pages/month 500 to 2,000 docs < 0.001 euros at scale On-premise possible
Donut self-hosted > 200,000 pages/month 1,000 to 3,000 docs < 0.001 euros at scale On-premise, no third-party OCR

Marker, Unstructured.io, and LlamaParse

Marker is an open-source PDF-to-Markdown converter that performs well on well-formed native PDFs. It handles tables, formulas, and complex layouts with better quality than pdfplumber on typographically rich documents. It includes no semantic layer: it is used as a parser before an LLM.

Unstructured.io is a document ingestion framework that unifies processing paths across formats (PDF, Word, PowerPoint, HTML, email) into a common API. Useful when the pipeline needs to handle heterogeneous formats, though the abstraction layer can mask OCR quality issues.

LlamaParse is a cloud parser optimized for RAG pipelines, with solid table and structured-element handling. It is relevant when the goal is indexing for an AI assistant rather than field extraction to an ERP. Our article on multimodal RAG with images, PDFs, and tables covers these architectures in detail.

5. Structured output: Pydantic, instructor, and JSON Schema

Unstructured extraction produces text. Structured extraction produces data. The difference is critical the moment you target automatic integration with an ERP or SQL database: a badly formatted date, an amount with a comma instead of a decimal point, a null field where the ERP expects an empty string. All of these generate production errors.

The base pattern with JSON Schema

Modern LLMs (GPT-4o, Claude Sonnet, Mistral Large) accept a JSON Schema as an API parameter. The model commits to producing output that conforms to that schema. This is the first layer of safety.

But LLMs can still violate semantic constraints: producing a future date for an "issue date" field, or a negative amount for a pre-tax total. That is where the upper layers come in.

For a deep-dive into production patterns for structured LLM outputs, see our article on structured outputs in LLM production systems.

Instructor and Pydantic for output validation

Instructor is a Python library that wraps LLM calls with automatic Pydantic validation and retry logic. If the LLM produces invalid output, instructor relaunches the request with the validation error message in context, allowing the model to self-correct. In practice, 95 to 99% of outputs pass on the first attempt, and the remaining 1 to 5% pass on the second.

from pydantic import BaseModel, Field, field_validator
from datetime import date
from typing import Optional

class InvoiceLine(BaseModel):
    reference: str
    description: str
    quantity: float = Field(gt=0)
    unit_price_excl_tax: float = Field(gt=0)

class Invoice(BaseModel):
    number: str
    supplier: str
    issue_date: date
    due_date: Optional[date]
    lines: list[InvoiceLine]
    total_excl_tax: float = Field(gt=0)
    vat_rate: float = Field(ge=0, le=1)
    total_incl_tax: float = Field(gt=0)
    global_confidence: float = Field(ge=0, le=1)
    fields_to_review: list[str] = []

    @field_validator("total_incl_tax")
    def verify_vat_consistency(cls, v, info):
        if "total_excl_tax" in info.data and "vat_rate" in info.data:
            expected = info.data["total_excl_tax"] * (1 + info.data["vat_rate"])
            if abs(v - expected) > 0.05:
                raise ValueError("Total incl. tax inconsistent with excl. tax and VAT rate")
        return v

The arithmetic validation (VAT consistency check) is a safety net that LLMs alone do not guarantee. It prevents silently incorrect extraction results from reaching the accounting system.

6. Honest evaluation metrics

A pipeline without an instrumented evaluation set is a pipeline whose real quality is unknown. This is the situation most projects find themselves in: a few manual tests on sample documents, then deployment. Problems surface in production, often weeks later.

Precision and recall per field

Precision and recall must be calculated at the field level, not the document level. A document extracted "correctly" at 80% accuracy can have a systematically wrong amount field. Without per-field measurement, that problem remains invisible.

  • Per-field precision: among the values extracted for this field, what proportion is correct?
  • Per-field recall: among the values that should have been extracted, what proportion was?
  • Straight-through rate: proportion of documents processed without human intervention

Realistic targets for a production pipeline: precision above 94% on key fields (amount, date, reference number), recall above 90%, straight-through rate between 80 and 90%.

OCR CER: the independent metric to measure first

The Character Error Rate measures OCR quality independently of everything downstream. It is calculated on a manually annotated document sample:

CER = (insertions + deletions + substitutions) / total reference characters

A CER below 3% is the production target. Above 5%, recognition errors propagate systematically into extraction: a "0" read as "O", a "1" read as "l", an amount turned into a string that the Pydantic validator rejects.

Target metrics for a production PDF extraction pipeline

Metric Production target
Per-field precision on key fields (amount, date, reference)> 94%
Recall (no key field missed)> 90%
Straight-through processing rate80 to 90%
Downstream error rate (ERP)< 1%
OCR CER (scan branch)< 3%
Latency per document (under 10 pages)< 10 s
Cost per processed page< 0.05 euros

The evaluation set must be built before pipeline development, on a stratified sample that represents the real variability of your documents: different suppliers, different years, different scan qualities. This set must never be used for training.

Cost per document as an operational metric

Beyond accuracy metrics, cost per processed document is the metric that justifies and calibrates the investment. It breaks down into API cost (OCR plus LLM), infrastructure cost, and human review cost (HITL). A cost below 0.05 euros per page is generally the threshold below which automation becomes economically obvious compared to manual entry.

7. HITL and confidence thresholds

Human in the Loop is not an admission that the automated system failed. It is the mechanism that keeps production data quality acceptable without total supervision. A pipeline without HITL forces a choice between two bad options: deploy data of insufficient quality, or maintain exhaustive human oversight that eliminates the automation gain.

How to set the confidence threshold

The confidence threshold applies per field, not per document. A document may have an overall score of 0.92 but a "contract number" field at 0.71: that field should be flagged for review, not the entire document.

Threshold calibration depends on the cost of a downstream error. For an amount field feeding an accounting system, a threshold of 0.90 is reasonable. For a comment field going into a CRM, 0.75 may be sufficient. The general rule: set the threshold to keep the human review rate between 10 and 20% of documents. Below that, you are taking too much risk. Above it, the economic gain erodes.

The review interface

A HITL without an efficient review interface turns human review into a slow, painful task. A minimal interface must display the source document, highlight extracted zones, allow inline correction, and log corrections to feed future evaluations. Tools like Label Studio or custom interfaces built with FastAPI and htmx serve this purpose well.

Accumulated corrections are a valuable data source for improving models over time. That is the virtuous loop: HITL improves the data, which improves the model, which reduces the HITL rate.

8. Downstream integration: RAG or ERP

PDF extraction is not an end in itself. Extracted data has two primary destinations: integration into a transactional system (ERP, accounting, CRM) or indexing into a document search system (RAG).

ERP integration

Integration toward an ERP (SAP, Sage, Odoo, NetSuite, Cegid) is the most common use case for supplier invoices and purchase orders. Technical prerequisites: a REST API or import mechanism (EDI file, native connector) on the ERP, and a mapping layer between extracted fields and the ERP data model.

SAP has a REST API from SAP S/4HANA onwards. Sage 100 and Sage X3 offer APIs and native XML/CSV imports. Odoo is the most permissive with a complete JSON-RPC API. Legacy on-premise ERPs without documented APIs are the most frequent source of project timeline overruns in practice.

Indexing for a RAG document system

When the goal is to make a PDF corpus queryable in natural language (contracts, reports, data sheets, archives), extraction feeds a RAG pipeline. In that case, the objective is not structured field extraction but preserving semantic structure for chunking and vector indexing.

The two use cases can coexist: structured fields (amount, date, reference) are stored in a relational database for transactional queries, and the full text is indexed in a vector store for semantic queries. Our article on multimodal RAG with images, PDFs, and tables describes this architecture in detail. For a broader view of RAG deployment costs and failure modes, see our guides on RAG project costs and TCO and production RAG failure modes.

9. Realistic costs, timelines, and TCO

The figures below come from projects we have delivered at Tensoria for SMBs and mid-market companies. They cover a typical scope: one to three document types, a volume of 10,000 to 100,000 pages per year, integration with an ERP or accounting system.

Proof of concept (4 to 6 weeks): 6,000 to 12,000 euros

The POC covers annotation of 200 to 500 representative documents, construction of the OCR and extraction pipeline, constitution of the evaluation set, and a minimal review interface. At the end of the POC, you have real metrics (CER, per-field precision, straight-through rate) and a reliable estimate for the MVP.

Production MVP (2 to 3 months): 15,000 to 30,000 euros

The MVP includes ERP or accounting integration, the human validation workflow (HITL), handling of multiple document types, and production monitoring. This is the first system actually deployed to operational teams.

Annual TCO at scale

The annual operating cost breaks down into three items:

  • API costs (Azure DI plus LLM): 100 to 800 euros per month depending on volume
  • Model maintenance and adding new document types: 3 to 5 days per year
  • Residual human review (10 to 20% of documents): pooled with existing operations teams

Annual TCO falls between 10,000 and 25,000 euros. For an SMB that manually processes 20,000 documents per year at 5 minutes of data entry each and a fully-loaded staff cost of 30 euros per hour, the gross saving is 50,000 euros. ROI is typically reached in 6 to 12 months.

What stretches the timeline

The most frequent sources of delay: format heterogeneity (30 suppliers, 30 purchase-order variants, each requiring its own calibration), poor quality of legacy scans requiring specific image pre-processing, integration with a legacy on-premise ERP without a documented REST API, and GDPR review for documents containing personal data (contracts, HR records, health data).

10. The pitfalls that derail production deployments

These pitfalls are drawn from real projects. They are not theoretical.

Underestimating document variability

The "standard" purchase order from your main supplier is not the same as those from your other 29 suppliers. Each new format can degrade metrics if the model has not been evaluated on it. The fix: build an evaluation set that covers real variability from the start, and systematically add every new format detected in production.

Not measuring OCR CER

Assuming the OCR is correct without measuring CER on an annotated sample. A document scanned at 72 dpi with a folded corner generates OCR errors that propagate all the way to production. Without an independent CER measurement, you spend weeks optimizing the wrong component.

Running native and scanned PDFs through the same pipeline

The two document types require radically different approaches. Mixing them without automatic type detection generates silent errors: the text parser does not raise an error on an image-PDF, it simply returns empty text or gibberish.

Not versioning document types

Suppliers change their invoice formats, often without warning. Without format drift detection (via comparison of extracted field distributions) and alerting, extracted fields can be silently wrong for weeks before an accounting reconciliation reveals the problem.

Unbounded scope from day one

"All PDFs in the company" is an unmanageable scope. Start with one document type on one flow (for example, supplier invoices received by email) and iterate. Each document type is a distinct project with its own metrics, its own evaluation set, and its own validation workflow.

Pre-launch checklist for an AI PDF extraction project

  • ->Inventory of document types and volumes by type (pages/month)
  • ->Sample of 50 to 100 representative documents per type, including degraded cases
  • ->Identification of the data destination (ERP, SQL database, RAG) and integration constraints
  • ->GDPR review if documents contain personal data
  • ->Definition of key fields to extract and per-field acceptance criteria
  • ->Identification of the teams that will handle HITL review and the time they can commit

For companies whose PDFs contain personal data (customer contracts, HR records, health data), a Data Protection Impact Assessment is recommended. The ICO publishes a comprehensive DPIA guide covering evaluation criteria. For teams handling sensitive documents in the EU and needing guidance on EU AI Act obligations, our article on EU AI Act compliance covers the risk tiers and documentation requirements.

Applications by sector

AI-based PDF extraction applies across any sector with a structured document flow of sufficient volume. The most common cases we encounter:

  • Accounting firms and audit practices: supplier and client invoices, extraction of tax fields (VAT, pre-tax amount, invoice date, invoice number) for integration into accounting journals
  • B2B distributors and manufacturers: multi-supplier purchase orders in heterogeneous formats, standardized extraction for ERP integration
  • Legal and notarial practices: deeds, contracts, diagnostics, extraction of clauses, amounts, party identities, and key dates
  • Construction and engineering firms: specifications, bills of quantities, technical reports, extraction of specifications, quantities, and equipment references
  • Insurance brokers: policy schedules, extraction of coverage, deductibles, and effective dates for CRM population
  • E-commerce and logistics: supplier delivery notes, automatic reconciliation with orders in the WMS

FAQ: AI PDF data extraction

A native PDF is generated by software and contains a digital text layer that pdfplumber or PyMuPDF can access directly, yielding near-perfect text quality without OCR. A scanned PDF is an image: there is no text layer, so an OCR step is required before any semantic analysis. The quality of that OCR determines everything downstream. A degraded scan (below 150 dpi, stained or folded document) can produce a CER above 10%, making downstream extraction unreliable.
GPT-4o vision is the right choice for low volumes (under 5,000 pages per month) and varied documents whose layout changes frequently. It offers maximum flexibility with no prior annotation. Azure Document Intelligence becomes the standard option once volume exceeds 10,000 pages per month, or when documents are recurring and typed (purchase orders, invoices from the same supplier). Azure DI is natively layout-aware: it preserves tables and spatial structure more reliably than vision alone, at a lower per-page cost at scale.
CER (Character Error Rate) measures the character-by-character error rate between the OCR output and the ground-truth text. A CER of 3% means 3 characters in 100 are wrong. It must be measured on an annotated sample before optimizing any downstream extraction model, because OCR errors propagate silently: if the OCR turns "1,245.00" into "1,2A5.00", the amount field will be wrong regardless of LLM quality. The realistic production target is a CER below 3%.
HITL means defining a per-field confidence threshold (typically 0.85 to 0.90). When an extracted field's confidence falls below that threshold, the document is routed to a human review queue rather than sent directly to the ERP. This partial review keeps production data quality acceptable without requiring 100% supervision. In practice, 10 to 20% of documents go through the human review circuit, which dramatically reduces manual data-entry workload while avoiding silent errors.
LayoutLM v3 is better suited to forms with a stable, labeled layout, because it jointly exploits text, spatial position, and page structure. Donut processes the document directly as an image without an OCR step, making it more robust on visually complex documents or variable-quality scans. Both require an annotated dataset of 500 to 2,000 documents. The right criterion is the degree of visual structure in your target documents: labeled, positionally stable fields point to LayoutLM; visually complex or scan-heavy documents point to Donut.
A proof-of-concept covering one document type costs between 6,000 and 12,000 euros over 4 to 6 weeks: annotation of 200 to 500 documents, OCR and extraction pipeline, evaluation set, review interface. A production-ready MVP integrated with an ERP or accounting system costs between 15,000 and 30,000 euros over 2 to 3 months. The annual TCO at scale (Azure DI and LLM API costs, model maintenance, residual human review) runs between 10,000 and 25,000 euros per year.

Further reading

Running a PDF-heavy workflow?

At Tensoria, we scope document extraction projects from the POC stage: document-type inventory, stack benchmarking against your actual volume, honest metrics on your real documents. Results in 4 to 6 weeks.

Discuss your project
Anas Rabhi, data scientist specializing in generative AI
Anas Rabhi Data Scientist & Founder, Tensoria

I am a data scientist specializing in generative AI. I help engineering teams and technical leaders ship production-grade AI systems tailored to their domain. Process automation, internal knowledge assistants, intelligent document processing. I design systems that integrate into existing workflows and deliver measurable results.