Does using structured outputs cost more?

Token cost per request is identical — you pay for the tokens you use regardless of mode. The cost impact comes from two sources: first, function-calling or structured-output requests add schema tokens to your input, typically 200-800 tokens for moderate schemas; second, retry-on-failure increases your total token spend per successful extraction. At scale, schema caching (OpenAI caches the schema token prefix) and lean schema design significantly reduce the cost premium.

Structured Outputs in Production: JSON Mode, Function Calling, and Constrained Decoding

Q: Can I use structured outputs with streaming?

Yes, but the UX implications differ from unstructured streaming. With streaming structured outputs you receive a partial JSON object that is only valid once complete. Instructor supports streaming via its create_partial method, which yields Pydantic models with populated fields as they arrive while leaving unresolved fields as None. This enables field-by-field progressive rendering in UIs that can tolerate partial state.

Structured outputs from LLMs sound trivial until you're three months into production and your downstream pipeline has a 10% silent failure rate because the model decided to wrap its JSON in a markdown code block, nest an extra field it invented, or emit a trailing comma. At that point "just tell the model to respond in JSON" looks like what it always was: wishful thinking.

This article covers the full stack — from why naive JSON prompting breaks, through provider-level structured output mechanisms (OpenAI, Anthropic, Gemini), to library-level solutions (Instructor, Outlines, Guidance, xgrammar), schema design patterns, retry strategies, streaming considerations, and cost trade-offs. The goal is to give you a clear decision tree for your production system, not a feature tour. If you are building AI agents that depend on reliable tool use, or wiring LLM outputs into an existing LLM integration pipeline, this is the foundational piece you need to get right first.

My default recommendation for Python teams going into production today: Instructor + OpenAI strict json_schema. It is boring, well-maintained, and it works. Everything else in this article is context for why that recommendation holds and when you need something different.

Why naive JSON prompting fails in production

The naive approach looks like this:

# Do not do this in production
system_prompt = """
You are a data extraction assistant.
Always respond with valid JSON only. No other text.
"""

user_prompt = f"""
Extract the following from this invoice text:
- vendor name
- invoice date
- total amount
- line items

Invoice text: {invoice_text}
"""

This works in notebooks. It breaks in production for several well-documented reasons. The failure rate on naive JSON prompting sits between 5% and 15% depending on the model, the schema complexity, the input length, and the temperature setting. The failures are not random noise — they cluster around specific patterns:

Markdown wrapping: The model wraps its response in ```json ... ``` because training data correlates JSON content with markdown code blocks.
Trailing commas and comment artifacts: Models sometimes emit JavaScript-style comments or trailing commas that break json.loads().
Schema hallucination: The model invents fields that were not in your schema, or omits required fields when the input provides no signal for them.
Type coercion failures: You asked for an integer, the model returned "42" as a string. Pydantic will catch this if you're using it; raw JSON parsing will not.
Premature generation halt: Long JSON responses occasionally truncate mid-object when the model exhausts its max_tokens allocation, leaving you with unparseable partial output.
Escaped character bugs: Unescaped backslashes inside string values (common in file paths, regex patterns, LaTeX content) silently corrupt the JSON.

The practical consequence is that your downstream pipeline — which assumed a valid, schema-conformant dict — fails silently or throws a raw exception. If you are not logging and monitoring every extraction call, you will not notice until a user or a downstream system surfaces the corruption. This is not a theoretical concern. We have audited pipelines where 8–12% of production extractions were returning empty results or default fallbacks because the JSON parse was silently failing and the code was swallowing the exception. (For extraction from PDFs with tables and figures rather than plain text, see multimodal RAG — the structured output principles apply but the ingestion side is fundamentally different.)

Lesson learned

The worst failures are silent ones. A pipeline that raises an exception at least tells you something broke. A pipeline that catches the JSON parse error, logs a warning, and returns an empty dict will silently corrupt your data for weeks before anyone notices. Before you fix the parsing, fix the observability. Every structured extraction call should emit a trace that records whether parsing succeeded, which fields were populated, and what the raw model output looked like.

Three concepts you should not conflate

Before going into implementations, the terminology needs to be precise because "JSON mode", "function calling", and "structured outputs" are used interchangeably in documentation and blog posts, and they are not the same thing.

JSON mode is a generation constraint that guarantees syntactically valid JSON. The model is free to produce any JSON structure it wants — any field names, any nesting, any types. It will not emit markdown wrappers or trailing commas. It will still hallucinate fields, omit required properties, and return wrong types. JSON mode is a syntax guarantee, not a schema guarantee.

Function calling (also called tool use) was originally designed as a mechanism for the model to indicate that it wants to invoke an external function, returning a structured argument payload that your code then uses to actually call the function. The model does not call the function — it returns a structured object representing the intended call. Teams quickly realized this mechanism is also excellent for structured data extraction: define a fake "function" whose parameters match the schema you want, and the model returns a schema-conformant object. It is a repurposing of tool use for extraction. OpenAI, Anthropic, and Google all support this pattern — though the implementation quality differs significantly (see our provider comparison for the specifics).

Structured outputs with schema enforcement is the modern term for provider-level constrained decoding against a specific JSON Schema. You pass a schema, the provider uses token-masking techniques to ensure the output matches that schema exactly — correct field names, correct types, all required fields present. This is categorically different from JSON mode. OpenAI introduced this as response_format with json_schema and strict: true in August 2024. Anthropic reached GA with their equivalent in early 2026.

Provider-level structured outputs

OpenAI: strict json_schema

OpenAI's structured outputs with strict: true is the most production-hardened option for API-hosted models. The failure rate drops to near zero for schemas that meet a set of documented requirements. The implementation looks like this:

from openai import OpenAI
import json

client = OpenAI()

schema = {
    "type": "object",
    "properties": {
        "vendor_name": {"type": "string", "description": "Legal name of the vendor"},
        "invoice_date": {"type": "string", "description": "ISO 8601 date string"},
        "total_amount": {"type": "number", "description": "Total invoice amount"},
        "currency": {"type": "string", "enum": ["EUR", "USD", "GBP"]},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity": {"type": "integer"},
                    "unit_price": {"type": "number"}
                },
                "required": ["description", "quantity", "unit_price"],
                "additionalProperties": false
            }
        }
    },
    "required": ["vendor_name", "invoice_date", "total_amount", "currency", "line_items"],
    "additionalProperties": false
}

response = client.chat.completions.create(
    model="gpt-4o-2024-11-20",
    messages=[
        {"role": "system", "content": "Extract structured invoice data from the provided text."},
        {"role": "user", "content": invoice_text}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "invoice_extraction",
            "strict": True,
            "schema": schema
        }
    }
)

data = json.loads(response.choices[0].message.content)

Several constraints apply when using strict: true. Every object must have "additionalProperties": false. All properties listed in required must be in the schema. Optional fields must use a union of the actual type with "null" — not simply absent from required. These constraints feel restrictive at first, but they are actually good schema design discipline that you should be applying anyway. See OpenAI's structured outputs documentation for the complete list.

One important note on OpenAI schema caching: when you use the same schema across multiple requests, OpenAI caches the schema compilation server-side. The first request with a new schema has a small overhead; subsequent requests with the same schema are fast. This makes it cost-effective to use detailed, descriptive schemas in high-throughput settings.

Anthropic: tool use as structured extraction

Anthropic's Claude does not expose a response_format parameter in the same way. The idiomatic pattern is to define a tool with the desired schema and instruct the model to use it. This is not a workaround — Anthropic's documentation explicitly endorses this for structured extraction. The model returns a tool_use content block with the extracted data as the tool input. See Anthropic's tool use documentation for implementation details.

import anthropic
import json

client = anthropic.Anthropic()

tools = [
    {
        "name": "extract_invoice",
        "description": "Extract structured invoice data from text",
        "input_schema": {
            "type": "object",
            "properties": {
                "vendor_name": {"type": "string"},
                "invoice_date": {"type": "string"},
                "total_amount": {"type": "number"},
                "currency": {"type": "string", "enum": ["EUR", "USD", "GBP"]},
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "description": {"type": "string"},
                            "quantity": {"type": "integer"},
                            "unit_price": {"type": "number"}
                        },
                        "required": ["description", "quantity", "unit_price"]
                    }
                }
            },
            "required": ["vendor_name", "invoice_date", "total_amount", "currency", "line_items"]
        }
    }
]

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=tools,
    tool_choice={"type": "tool", "name": "extract_invoice"},
    messages=[{"role": "user", "content": invoice_text}]
)

# Extract the tool_use block
tool_use_block = next(b for b in response.content if b.type == "tool_use")
data = tool_use_block.input

The tool_choice: {"type": "tool", "name": "..."}} parameter forces the model to always call the named tool rather than choosing whether to call it. This is essential for extraction use cases — without it, Claude might decide to respond in text. This pattern also integrates naturally with agentic architectures where the same model both extracts data and decides what to do with it.

Google Gemini: response_mime_type and response_schema

Gemini supports structured outputs via generation_config with response_mime_type: "application/json" and an optional response_schema. As of 2026, Gemini 1.5 Pro and 2.0 Flash both support schema-constrained generation. The schema syntax follows the OpenAPI subset of JSON Schema. For multi-modal extraction use cases — pulling structured data from documents with embedded images — Gemini's native vision capabilities combined with structured outputs make it a strong choice worth evaluating.

Lesson learned

Provider-level structured outputs do not eliminate all failure modes. They guarantee the output matches your schema's structural constraints. They do not guarantee semantic correctness — the model can still extract the wrong vendor name, misparse a date format, or set a required numeric field to zero because the source document was ambiguous. Schema conformance and factual accuracy are independent properties. Your eval pipeline needs to check both — see building custom LLM judges for the semantic side.

Library-level solutions: Instructor

The Instructor library (by jxnl/567-labs) is the de facto Python standard for structured LLM outputs. With over 3 million monthly downloads and 11k GitHub stars, it wraps the provider clients and adds three things that raw function calling or structured outputs do not give you: Pydantic-native model definitions, automatic retry with validation-error feedback, and a consistent API across 15+ providers.

The Pydantic integration is the key differentiator. Instead of writing raw JSON Schema by hand, you define your extraction target as a Pydantic model and Instructor handles the schema serialization:

from pydantic import BaseModel, Field
from typing import Optional
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class LineItem(BaseModel):
    description: str = Field(description="Product or service description")
    quantity: int = Field(ge=1, description="Quantity ordered")
    unit_price: float = Field(ge=0.0, description="Price per unit in document currency")

class Invoice(BaseModel):
    vendor_name: str = Field(description="Legal name of the issuing vendor")
    invoice_date: str = Field(description="Invoice date in ISO 8601 format (YYYY-MM-DD)")
    total_amount: float = Field(ge=0.0, description="Total amount due")
    currency: str = Field(pattern="^(EUR|USD|GBP)$")
    line_items: list[LineItem]
    notes: Optional[str] = Field(default=None, description="Any additional notes on the invoice")

invoice = client.chat.completions.create(
    model="gpt-4o-2024-11-20",
    response_model=Invoice,
    max_retries=3,
    messages=[
        {"role": "user", "content": f"Extract invoice data from:\n\n{invoice_text}"}
    ]
)

# invoice is a validated Invoice instance — not a dict, not JSON
print(invoice.vendor_name)
print(invoice.total_amount)

The max_retries=3 parameter is not a simple retry on network failure. When Pydantic validation fails on the model's response, Instructor constructs a new message containing the specific validation error and sends it back to the model, asking it to correct the output. The model sees exactly what was wrong — "the value '2024/01/15' does not match the ISO 8601 pattern" — and attempts to self-correct. This is fundamentally different from a blind retry. In our experience with invoice and contract extraction pipelines, this mechanism resolves the vast majority of residual failures that slip past schema-level constraints.

Instructor supports OpenAI, Anthropic, Google Gemini, Mistral, Cohere, and local models through LiteLLM, all via the same response_model interface. Switching providers for an extraction task is a one-line change. This matters for teams that run different models for different cost/performance tiers. See the Instructor models documentation for the provider-specific adapter details.

Local models and constrained decoding: Outlines, Guidance, xgrammar

For self-hosted or local model inference — vLLM, SGLang, llama.cpp, TensorRT-LLM — provider-level structured outputs are not available. You need to implement constrained decoding at the inference engine level. This is where the architecture changes fundamentally.

How constrained decoding works

When you pass a JSON Schema to a constrained decoding system, the schema is compiled into a finite state machine (FSM). This FSM represents every valid token path through the schema — every allowed sequence of tokens from the opening brace to the final closing brace. At each token generation step, the inference engine intersects the current FSM state with the model's full vocabulary and sets the logit of every token that would violate the schema to negative infinity before the sampling step. Invalid tokens have zero probability of selection. The model does not produce bad output and then get corrected — it literally cannot produce bad output.

Outlines (by dottxt-ai) is the most widely used Python library for constrained generation. It supports JSON Schema, Pydantic models, regex patterns, and arbitrary context-free grammars. It integrates with vLLM and llama.cpp and exposes a clean API:

from pydantic import BaseModel
from typing import Optional
import outlines

model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.3")

class Invoice(BaseModel):
    vendor_name: str
    invoice_date: str
    total_amount: float
    currency: str

generator = outlines.generate.json(model, Invoice)
invoice = generator(f"Extract invoice data from: {invoice_text}")
# invoice is a validated Invoice instance, zero schema failures

Guidance (by Microsoft) takes a different approach: instead of schema-compiled FSMs, it gives you a template language that interleaves generation and constraint inline. It is more expressive for complex generation patterns but has a steeper learning curve. Useful when your extraction logic is conditional — different schemas for different document types within a single generation pass.

xgrammar is the current performance state-of-the-art for constrained decoding. As of early 2026, it is the default structured generation backend for vLLM, SGLang, and TensorRT-LLM. Through vocabulary partitioning and adaptive token mask caching, xgrammar achieves under 40 microseconds per token of overhead — effectively zero cost compared to the generation step itself. For high-throughput self-hosted pipelines, xgrammar means constrained decoding has no practical performance penalty.

The choice between these libraries is mostly driven by your inference stack. vLLM users get xgrammar automatically. If you are using Outlines standalone, be aware that complex schemas — deeply nested objects, large enums, recursive definitions — can produce large FSMs whose compilation takes several seconds on first use. Cache your compiled generators. Do not recompile the same schema per-request.

Lesson learned

Constrained decoding eliminates structural failures but can subtly degrade model quality on hard extractions. When the model's natural next token is invalid and the system masks it, the model is forced to choose from a constrained vocabulary. For highly ambiguous inputs, this can cause the model to fill required fields with plausible-sounding but incorrect values rather than leaving them empty. Schema design matters more with constrained decoding, not less — make required fields truly required and use Optional fields generously.

Schema design patterns that matter in production

The schema you define has a direct impact on extraction quality, not just on structural validity. Field descriptions become part of the prompt — they are sent to the model as part of the schema and directly influence what the model generates. Treat them like prompt engineering, not documentation. For the full system prompt + schema + few-shot stack used in production extraction, see advanced prompt engineering for production.

Flat vs nested schemas

Keep nesting to two levels maximum where possible. Deeply nested schemas — objects within arrays within objects within objects — increase the FSM complexity, slow schema compilation, and raise the rate of semantic errors on smaller models. If you find yourself going three levels deep, consider whether your schema design is encoding domain structure that should live in your application logic instead.

Flat schemas also make retry-with-feedback more effective. When a validation error points to line_items[2].unit_price, the model has to reason about nested array indexing to understand the correction. When the error points to total_amount, the correction is immediate.

Enums for constrained values

Always use enum for fields with a known value set — document types, status codes, currency codes, category labels. This is where structured outputs shine: the model cannot return a value outside your enum, so your application code does not need to handle unexpected strings. Enums also significantly reduce the effective vocabulary at those positions in the generation, which improves accuracy on smaller models.

Optional fields and discriminated unions

For OpenAI strict mode, optional fields must be expressed as a union with null:

class Contract(BaseModel):
    party_a: str
    party_b: str
    effective_date: str
    expiry_date: Optional[str] = None        # Instructor handles the null union
    governing_law: Optional[str] = None
    contract_type: Literal["service", "nda", "employment", "partnership"]

Discriminated unions are powerful for pipelines that process heterogeneous document types. Rather than defining a single monolithic schema that tries to capture every document type, use a discriminator field to route to the appropriate schema:

from typing import Union, Literal, Annotated
from pydantic import BaseModel, Field

class InvoiceDocument(BaseModel):
    document_type: Literal["invoice"]
    vendor_name: str
    total_amount: float
    invoice_number: str

class ContractDocument(BaseModel):
    document_type: Literal["contract"]
    party_a: str
    party_b: str
    effective_date: str

class UnknownDocument(BaseModel):
    document_type: Literal["unknown"]
    raw_summary: str

DocumentExtraction = Annotated[
    Union[InvoiceDocument, ContractDocument, UnknownDocument],
    Field(discriminator="document_type")
]

This pattern keeps each sub-schema lean, makes downstream routing trivial (isinstance(result, InvoiceDocument)), and degrades gracefully on unrecognized documents through the UnknownDocument fallback type.

Field order as reasoning order

Field order in your schema is generation order. Place fields that require reasoning — computed values, classifications, confidence scores — after the fields they depend on. A model generating a total_amount field after line_items can cross-check its sum. A model generating total_amount first has no context to verify against. This is a non-obvious but measurable quality improvement for complex extractions.

Validation and retry strategies

Even with provider-level structured outputs, validation failures occur. Schema conformance is guaranteed; semantic correctness is not. Your retry strategy should distinguish between these two failure types.

Structural failures (wrong types, missing fields, extra fields) should not reach your application code at all. If you're using Instructor with max_retries, they won't. If you're using raw provider structured outputs, they shouldn't happen with strict mode. If they do, it's a sign your schema has a feature that the provider's strict mode doesn't support.

Semantic failures (wrong values, misidentified fields, hallucinated data) require application-level validation. Define Pydantic validators for domain constraints:

from pydantic import BaseModel, Field, field_validator
from datetime import date

class Invoice(BaseModel):
    vendor_name: str
    invoice_date: str
    total_amount: float = Field(ge=0.0)
    line_items: list[LineItem]
    computed_total: float = Field(ge=0.0)

    @field_validator("invoice_date")
    @classmethod
    def validate_date_format(cls, v: str) -> str:
        try:
            date.fromisoformat(v)
        except ValueError:
            raise ValueError(f"invoice_date must be ISO 8601, got: {v!r}")
        return v

    @field_validator("computed_total")
    @classmethod
    def validate_totals_match(cls, v: float, info) -> float:
        if "line_items" in info.data:
            expected = sum(item.quantity * item.unit_price for item in info.data["line_items"])
            if abs(v - expected) > 0.01:
                raise ValueError(
                    f"computed_total {v} does not match line_items sum {expected:.2f}"
                )
        return v

When Instructor's retry-with-feedback kicks in, it sends the full Pydantic ValidationError message back to the model. A validation error like "invoice_date must be ISO 8601, got: '15/01/2024'" is specific enough that the model corrects it on the next attempt in nearly all cases. A generic "validation failed" message is not.

Set a hard maximum on retries — typically 2 to 3. After exhausting retries, log the raw model output and the validation errors, route the document to a human review queue, and do not silently return a partial or default object. Partial extractions are worse than no extractions because downstream code assumes completeness.

Streaming with structured outputs

Streaming and structured outputs interact in a non-obvious way. With unstructured streaming you receive tokens as they arrive and render them progressively. With structured outputs, the JSON is only syntactically complete — and therefore parseable — at the end of generation. Streaming mid-generation produces partial JSON that cannot be passed to json.loads().

Instructor addresses this through its create_partial method, which yields a Pydantic model with progressively populated fields as the JSON streams in:

import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

for partial_invoice in client.chat.completions.create_partial(
    model="gpt-4o-2024-11-20",
    response_model=Invoice,
    messages=[{"role": "user", "content": f"Extract: {invoice_text}"}],
):
    # partial_invoice is a valid Invoice instance
    # Fields populated so far have values; remaining fields are None
    if partial_invoice.vendor_name:
        update_ui_vendor(partial_invoice.vendor_name)
    if partial_invoice.total_amount:
        update_ui_total(partial_invoice.total_amount)

The latency benefit of streaming structured outputs is real but context-dependent. If you are populating a UI where different fields appear in different parts of the interface, streaming lets you show the vendor name and date while the line items are still generating. If you are feeding the output into a downstream computation that needs the complete object, streaming adds implementation complexity without reducing time-to-first-complete-output.

One constraint worth noting: OpenAI's strict json_schema mode and streaming are compatible, but Anthropic's tool use streaming requires reassembling the tool input from streaming delta events before you can parse it. The Instructor adapter handles this transparently, which is another argument for using the abstraction layer rather than the raw provider SDK for anything involving streaming.

When not to use structured outputs

Structured outputs are not always the right tool. Knowing when to avoid them is as important as knowing how to use them.

Creative and freeform generation. If you are generating marketing copy, summarizing documents for human reading, or producing narratives, structured outputs add friction without benefit. The token masking that makes structured outputs reliable also constrains the model's generation space in ways that hurt creative fluency. Use structured outputs for extraction and classification; use free generation for synthesis.

Open-ended analysis where the schema is unknown. When you are exploring a domain and do not yet know the right schema — early-stage data science, exploratory document analysis, research summarization — forcing a schema prematurely locks you into a representation that may not fit the data. Start with free generation to understand what the data actually contains, then design a schema informed by that understanding.

When reasoning quality matters more than output format. Chain-of-thought and extended thinking produce better results when the model can reason freely before committing to an answer. One pattern that works well: let the model generate a free-text reasoning trace first, then in a second pass (or using Instructor's chain_of_thought pattern) extract the structured output from that trace. The first pass improves accuracy; the second pass enforces structure. For multi-agent pipelines, this often maps naturally to a planner-extractor architecture.

Very large output schemas. Schemas with hundreds of fields, deeply nested arrays, or many large enums consume a significant number of input tokens and can push smaller models into poor-quality extraction territory. If your schema has more than 40–50 fields, consider whether it should be decomposed into multiple targeted extraction calls rather than one monolithic pass.

Cost and latency trade-offs

The cost impact of structured outputs is real but often overstated. Let's be precise about where the overhead actually comes from.

Schema token overhead. Every structured output request sends your JSON Schema as part of the request payload. A moderate schema — 10 to 15 fields with descriptions — adds approximately 300 to 600 input tokens. At GPT-4o pricing (~$2.50 per million input tokens), this is $0.00075 to $0.0015 per request. At 1 million daily extractions, that is $750 to $1,500 per day in schema overhead. Not zero, but not the dominant cost driver unless your base extraction is very short. OpenAI's schema caching (applied automatically when you reuse schemas) significantly reduces this for high-throughput use cases.

Retry cost. With Instructor's max_retries=3, worst-case cost is 3x the base extraction cost. In practice, with strict mode enabled, retry rates are below 2% in well-designed pipelines — the retries add less than 2% to your average cost per extraction.

Model tier differences. Function-calling capable models are not uniformly priced. Verifying whether a specific model supports structured outputs requires checking the provider's documentation for that model version. Smaller models within the same provider family (GPT-4o-mini, Claude Haiku) support structured outputs at significantly lower cost per token. For high-volume extraction pipelines where the schema is simple and the documents are well-structured, a smaller model with structured outputs often outperforms a larger model with JSON mode — both in cost and in reliability.

Latency impact. Provider-level structured outputs add no measurable latency compared to standard generation for the same output length. The token masking is applied during the generation step that was already happening. For local models using xgrammar, the overhead is under 40 microseconds per token — imperceptible. The dominant latency factor is output token count, which is determined by your schema's complexity, not by whether you use structured outputs. A schema that generates 500 tokens of JSON takes 500 tokens of generation time regardless of the enforcement mechanism.

Lesson learned

The teams that optimize structured output costs most effectively are the ones that design their schemas for extraction precision rather than schema completeness. A schema with 8 well-chosen fields and detailed descriptions outperforms a schema with 30 fields at a third of the token cost. If you find yourself adding fields "just in case", you are paying for noise. Define your schema from the downstream use case backward — only extract what a downstream system will actually consume.

The decision tree in practice

Given everything above, here is how to make the architecture decision for a new extraction pipeline:

API-hosted model, Python, standard schema complexity: Instructor + OpenAI strict json_schema. This is the boring-but-correct default. Add Anthropic tool use as a fallback provider through the same Instructor interface if needed.
Self-hosted or local model (vLLM, llama.cpp, SGLang): xgrammar via vLLM's structured outputs API, or Outlines for standalone use. Do not implement constrained decoding yourself.
Complex conditional schemas or grammar-constrained generation: Guidance for fine-grained control, or Outlines with regex patterns for intermediate complexity.
Multi-modal document extraction (PDFs with images, scanned documents): Evaluate Gemini with response_schema for documents where vision and text are both needed.
High-volume, cost-sensitive pipelines: Profile GPT-4o-mini or Claude Haiku with strict mode before defaulting to frontier models. For many document types the smaller models are entirely sufficient.
Agent tool use in an agentic pipeline: Provider-native tool use (Anthropic tool_choice, OpenAI function calling) maps directly to the agent architecture. See our post on Agentic RAG for how this integrates with retrieval planning. Production AI agents at any meaningful scale depend on reliable structured outputs to coordinate between agent steps.

What does not fit anywhere in this decision tree: naive JSON prompting. There is no production scenario in 2026 where "just tell it to respond in JSON" is the right choice when better alternatives exist at the same latency and cost. The 5–15% failure rate is not acceptable, and the failure mode — silent bad data downstream — is particularly dangerous for pipelines that do not have end-to-end monitoring. If you are auditing an existing pipeline and find naked JSON prompting without schema enforcement, that is a high-priority fix. For the broader picture of where structured outputs fit in a full LLM integration stack, see our LLM integration service page.

Frequently asked questions

JSON mode guarantees syntactically valid JSON but enforces no schema — field names, types, and required properties are not constrained. Structured outputs with a json_schema and strict: true use constrained decoding to guarantee both valid JSON and schema conformance. JSON mode still fails on wrong field names, missing required fields, or incorrect value types. Structured outputs with strict mode do not.

The JSON Schema is compiled into a finite state machine (FSM) that represents every valid token path through the schema. At each generation step, the inference engine intersects the current FSM state with the model vocabulary and sets the logit of every invalid token to negative infinity before sampling. Invalid tokens have zero probability of selection. Libraries like xgrammar compile these FSMs at schema-compilation time, adding under 40 microseconds per token of overhead.

For most Python teams, yes. Instructor adds automatic retry-with-validation-feedback, Pydantic-native model definitions, streaming support for partial objects, and multi-provider compatibility on top of raw function calling or structured outputs. It does not replace provider-level constrained decoding — it sits above it. The combination of Instructor with OpenAI strict json_schema gives you both schema-enforced generation and application-level validation with minimal boilerplate.

Use Outlines or xgrammar when you are running local or self-hosted models (vLLM, SGLang, llama.cpp) that do not have native provider-level structured output support. These libraries implement constrained decoding at the inference engine level, meaning the model physically cannot produce malformed output regardless of schema complexity. For API-hosted models from OpenAI, Anthropic, or Google, provider-level structured outputs plus Instructor is the simpler path.

Token cost per request is identical — you pay for the tokens you use regardless of mode. The cost impact comes from two sources: schema tokens added to your input (typically 200-800 tokens for moderate schemas), and retry-on-failure increasing total token spend per successful extraction. At scale, schema caching and lean schema design significantly reduce the cost premium. In well-designed pipelines with strict mode, retry rates fall below 2%.

Yes. Instructor supports streaming via its create_partial method, which yields Pydantic models with progressively populated fields as they arrive while leaving unresolved fields as None. This enables field-by-field progressive rendering in UIs that can tolerate partial state. OpenAI's strict json_schema mode and streaming are compatible. Anthropic's tool use streaming requires reassembling the tool input from streaming delta events — the Instructor adapter handles this transparently.

Structured Outputs in Production: JSON Mode, Function Calling, and Constrained Decoding

Why naive JSON prompting fails in production

Three concepts you should not conflate

Provider-level structured outputs

OpenAI: strict json_schema

Anthropic: tool use as structured extraction

Google Gemini: response_mime_type and response_schema

Library-level solutions: Instructor

Local models and constrained decoding: Outlines, Guidance, xgrammar

How constrained decoding works

Schema design patterns that matter in production

Flat vs nested schemas

Enums for constrained values

Optional fields and discriminated unions

Field order as reasoning order

Validation and retry strategies

Streaming with structured outputs

When not to use structured outputs

Cost and latency trade-offs

The decision tree in practice

Frequently asked questions

Further reading