How much does LoRA fine-tuning cost in 2026?

A LoRA or QLoRA run on Llama 3 8B on a single H100 (80GB) for 8–12 hours costs roughly $50–300 in cloud compute depending on provider and reserved vs on-demand pricing. A full supervised fine-tuning run on a 70B model across multiple GPUs ranges from $5,000 to $50,000+. Managed fine-tuning APIs (OpenAI, Together AI) cost $0.48–$25 per million training tokens depending on model tier. These numbers exclude data curation, which is typically 2–5x the compute cost in engineering time.

Is it wrong to use fine-tuning to inject facts into a model?

Yes. Fine-tuning on factual data (product specs, internal policies, recent events) is one of the most persistent misconceptions in LLM engineering. It does not work reliably because: (1) models do not store facts as discrete addressable memories, they store statistical patterns; (2) facts baked into weights cannot be updated without retraining; (3) the model will hallucinate with more confidence because it was trained on that domain vocabulary. Use RAG for anything that is factual, updatable, or needs a traceable source. Fine-tuning is for behavior, not knowledge.

How much does a production RAG infrastructure cost per month?

For a small-to-medium deployment (10,000–50,000 queries per day), expect $200–2,000 per month. This breaks down roughly as: vector database (Pinecone, Qdrant, Weaviate) $70–500/month depending on index size and replication; embedding API calls (OpenAI text-embedding-3-small or equivalent) $50–300/month; LLM inference $100–1,500/month depending on model tier and context length. Self-hosted stacks (Qdrant on-prem, open-source embedding models) can cut this to $100–400/month at the cost of infrastructure management overhead.

Fine-tuning vs RAG vs Prompting: An Engineering Decision Framework

Q: What is the minimum dataset size for fine-tuning an LLM?

For LoRA adaptation on a well-curated, domain-specific task, 500–2,000 high-quality examples are often sufficient. For full supervised fine-tuning of a 7B–13B model to achieve meaningful behavioral change, expect to need 5,000–50,000+ examples. Quality dominates quantity: 500 expert-reviewed examples outperform 10,000 noisy ones. Instruction-following fine-tuning for a narrow task (e.g., structured output extraction) can work with as few as 200–300 examples if the task is well-defined and the base model is strong.

Most teams reach for fine-tuning when they should be writing a better system prompt. We have audited enough LLM projects to know this is one of the most expensive, time-consuming mistakes in applied AI engineering — not because fine-tuning is bad, but because it solves a fundamentally different problem than the one teams are usually trying to solve.

This article gives you a concrete decision framework: when prompting is enough (most cases), when to add RAG, when fine-tuning is actually warranted, and how to combine all three in production. Specific cost ranges, latency numbers, and data requirements are included throughout — not as rough estimates, but as working targets you can use for budgeting and architecture decisions. If you want the foundational RAG mechanics first, see our RAG technical guide.

The sequence is always the same: Prompt → RAG → Fine-tune. The mistake is skipping steps.

The Core Mental Model

Before the decision framework, you need one mental model that makes every subsequent decision cleaner.

RAG changes what the model can see right now. Fine-tuning changes how the model behaves every time.

That distinction is load-bearing. RAG injects knowledge at inference time — documents, database rows, API results — directly into the context window. The model reads them, reasons over them, and discards them when the session ends. Fine-tuning modifies the model's weights: its priors, its stylistic defaults, its output patterns. That change is permanent until you retrain.

Prompting does neither. It shapes the model's behavior within a single inference call using only the tokens already in its context. It is the cheapest, fastest, most reversible lever you have.

This maps cleanly to three types of problems:

Knowledge problems (the model doesn't know X) → RAG
Behavior problems (the model doesn't reliably do X) → Fine-tuning or better prompting
Instruction problems (the model doesn't understand what I want) → Prompt engineering

Most "the model doesn't perform well enough" complaints are instruction problems that better prompting solves. The ones that are genuinely behavior problems are the minority. Knowledge problems should never be solved with fine-tuning — and yet that is what teams keep trying to do.

Step 1: Prompting Is Probably Enough

The default answer to "should I fine-tune?" is no. Not because fine-tuning is bad, but because most teams have not written a serious system prompt yet. A serious system prompt is not three sentences. It is a carefully engineered instruction set with:

A precise role definition with explicit behavioral constraints
Output format specification (schema, length, tone, structure)
Concrete few-shot examples (3–10 is often sufficient)
Explicit handling of edge cases and refusals
Chain-of-thought scaffolding for tasks requiring reasoning

A well-engineered prompt with 5 high-quality few-shot examples regularly matches fine-tuned performance on structured tasks. Models like GPT-4o, Claude 3.7 Sonnet, and Gemini 2.0 Flash are instruction-following machines with 200K+ token context windows. Before you spend three weeks curating a fine-tuning dataset, spend three days doing serious prompt engineering with evals — see advanced prompt engineering for production for the full playbook. If you're also still deciding which model to commit to, our provider comparison covers the trade-offs.

When prompting has genuine limits

Prompting breaks down in specific, diagnosable situations:

Context window pressure at scale: If your system prompt + few-shot examples + retrieved context regularly exceeds 50K tokens and you are paying for every token on a frontier model, you are burning money on input cost. At GPT-4o pricing ($2.50/M input tokens), a 60K-token context at 10,000 daily queries costs ~$1,500/day — $45,000/month — in input tokens alone. Fine-tuning can compress stable instructions into weights and eliminate that overhead.
Reliability ceiling on strict schema: If you need the model to output a specific JSON schema on 99.9% of calls with no variation, prompt engineering has a practical ceiling. Structured output APIs (OpenAI's response_format, Anthropic's tool use) raise that ceiling substantially — try those first. If you still have a 1–5% schema failure rate that you cannot tolerate, fine-tuning is the right fix.
Latency-sensitive consumer applications: A 4,000-token system prompt adds ~200ms to P50 latency and ~400ms to P95 on most inference APIs. For a real-time voice assistant or sub-500ms UI interaction, that overhead is architecturally significant. Compressing the prompt into weights via fine-tuning is a legitimate optimization.

Lesson learned

We audited a SaaS team that had spent six weeks fine-tuning Llama 3 8B on 3,000 customer support examples because "the model's tone wasn't right." When we read the system prompt, it was two sentences. We rewrote the system prompt to 800 tokens with 6 few-shot examples and a clear persona definition. The tone problem disappeared. The fine-tuned model they had spent six weeks building was never deployed.

Step 2: Add RAG for Knowledge Problems

If you have exhausted prompt engineering and the remaining failures are knowledge failures — the model gets facts wrong, is not aware of recent events, or lacks domain-specific information — RAG is the right next step.

RAG's advantages in this context are structural:

Updateable without retraining: Add a document to the vector index tonight; the model has access to it tomorrow. No training run, no deployment, no downtime.
Traceable: Every answer cites a source chunk. This is non-negotiable in regulated industries (legal, finance, healthcare) where "the model said so" is not a valid justification.
Correctable: When the model gets something wrong, you fix the source document or improve the retrieval, not re-run a training job.

RAG infrastructure cost in 2026

For a small-to-medium deployment (10,000–50,000 queries per day), the monthly cost of a production RAG stack breaks down as follows:

Component	Managed (cloud)	Self-hosted
Vector database (Pinecone / Qdrant / Weaviate)	$70–500/mo	~$30–80/mo (infra only)
Embedding API (text-embedding-3-small or equiv.)	$50–300/mo	~$0 (open model on GPU)
LLM inference (GPT-4o mini / Mistral / Claude Haiku)	$100–1,500/mo	$150–600/mo (GPU rental)
Reranking (Cohere Rerank or cross-encoder)	$50–200/mo	~$20–50/mo (GPU sharing)
Total	$270–2,500/mo	$200–730/mo

These are operational costs. They do not include engineering time to build and maintain the pipeline. For the full picture of what can go wrong in a production RAG deployment, see our article on production RAG failure modes.

When RAG is not the right tool

RAG fails in predictable situations that teams consistently underestimate:

The knowledge is too volatile to index reliably: Real-time pricing, live inventory, streaming sensor data. RAG with a vector index assumes a document corpus with bounded update frequency. If the data changes faster than you can re-embed and re-index it, you need a structured database with text-to-SQL, not a vector store.
The task has no retrievable external knowledge: If you want the model to write in a very specific stylistic voice, classify inputs into domain-specific categories, or follow a proprietary decision tree that does not exist as indexable text, RAG adds no value. This is a behavior problem, not a knowledge problem.
Latency is truly critical: A full RAG pipeline — query embedding, ANN search, optional reranking, LLM generation — adds 300–1,200ms to baseline LLM latency at P50. For sub-500ms applications, this is often too much.

Step 3: Fine-tune for Behavior, Not Knowledge

Here is the most important misconception to kill before it costs you a training run: fine-tuning is not a knowledge injection mechanism. This is probably the single most common, most expensive mistake in LLM engineering.

When you fine-tune a model on factual data — product documentation, internal policies, recent events — several things happen that you do not want:

The model gains statistical patterns from that vocabulary, not addressable factual memories.
It will answer questions about that domain with more confidence, but the facts will be blended with the base model's priors. Hallucinations become more fluent and harder to detect.
The next time your documentation changes, you have to retrain. The training data is now a moving target.

If your problem statement is "the model doesn't know about our product," the answer is RAG. Period.

The legitimate use cases for fine-tuning

Fine-tuning is the right tool in a narrow, well-defined set of situations:

Consistent structured output: You need a model to reliably emit a specific JSON schema, a proprietary XML format, or a constrained output grammar on every single call. Structured output APIs should be your first attempt. If the failure rate is still unacceptable (>0.5–1%), fine-tuning the output behavior is justified.
Style and tone that prompting cannot reliably produce: If your brand requires a very specific voice — unusually terse, a particular technical register, a domain-specific jargon set — and you have written extensive prompt instructions without reaching the required consistency, fine-tuning on 500–2,000 high-quality examples of the target style is the right lever.
Function-calling patterns for domain-specific tool use: If you are building an agent that needs to invoke a complex set of proprietary tools, and the base model consistently mis-uses them despite careful prompting, fine-tuning on tool-use demonstrations substantially improves reliability. This is one of the cleanest fine-tuning use cases in enterprise AI.
Compressing a large stable system prompt for cost or latency: If you have a 6,000-token system prompt that is stable across 100% of your calls and you are running at high volume, fine-tuning that behavior into the model weights is a legitimate cost optimization — not a quality improvement, a cost reduction.
Domain-specific token vocabulary and abbreviations: Specialized fields (radiology, derivatives trading, industrial control systems) have abbreviation sets and term patterns that base models handle inconsistently. A small LoRA trained on domain text normalizes that vocabulary without touching factual knowledge.

Fine-tuning cost benchmarks for 2026

These are working numbers, not theoretical minimums. Real project costs are higher once you account for data curation, evaluation, and iteration.

Approach	Model	Compute cost	Total project cost (incl. data)
LoRA / QLoRA (self-serve)	Llama 3 8B / Mistral 7B	$50–300	$2,000–8,000
LoRA (managed API)	Mistral 7B on Together AI	$500–2,000	$3,000–10,000
Full SFT (self-serve)	Llama 3 70B	$5,000–20,000	$15,000–50,000+
Fine-tuning API	GPT-4o (OpenAI)	$25/M training tokens	$5,000–30,000+
Embedding fine-tuning	BGE-M3 / E5-large	$100–800	$2,000–6,000

The "total project cost" column is the number that matters for budget decisions. Compute is a minority of the real cost. Data curation — collecting, cleaning, deduplicating, and reviewing training examples — is typically 2–5x the compute cost in engineering time. Evaluation infrastructure, iteration cycles, and deployment add another layer. A "quick LoRA" that cost $200 in GPU time often costs $15,000 in actual engineering effort when you account for the full lifecycle. For the practical engineering side of LoRA and QLoRA — hyperparameters, dataset format, training infra — see our LoRA and QLoRA practical guide.

Data requirements: the hard gate

Fine-tuning has a data prerequisite that teams systematically underestimate. These are minimum viable thresholds, not ideal targets:

LoRA for a narrow, well-defined task: 500–2,000 high-quality, reviewed examples. Quality dominates — 500 expert-reviewed examples consistently outperform 10,000 scraped ones.
Full supervised fine-tuning for behavioral change: 5,000–50,000+ examples across a broad task distribution. Below 5,000, you risk overfitting on the training distribution and degrading on edge cases.
Embedding model fine-tuning (domain retrieval): 1,000–10,000 query-document pairs labeled as relevant. These are hard to generate without domain experts or strong proxy signals from production logs.
Instruction-following / structured output: 200–500 examples if the task is narrow and the base model is already capable. This is the most tractable fine-tuning scenario in practice.

If you cannot assemble the minimum dataset for your target approach, do not start a fine-tuning project. A model trained on insufficient data will be worse than the base model on out-of-distribution inputs — and detecting that regression requires an eval suite you probably do not have yet either.

Lesson learned

The most common fine-tuning failure pattern we see is teams building a dataset that mirrors production inputs but not production edge cases. The fine-tuned model improves on the 80% of queries it saw variants of, and regresses on the 20% it did not. You only discover this after deployment, because the eval set was sampled from the same distribution as the training set. Always hold out edge cases — adversarial, out-of-domain, malformed inputs — as a separate eval partition before you train anything.

The Decision Matrix

Here is the full decision matrix. Use the column that matches your primary failure signal, not your intuition about what sounds more sophisticated.

Failure signal	Root cause	Right approach	Wrong approach
Model doesn't know recent events / internal data	Knowledge gap	RAG	Fine-tuning
Output format is inconsistent / schema breaks	Behavior inconsistency	Structured output API, then fine-tuning	RAG
Model answers questions incorrectly on my domain	Knowledge gap (usually)	RAG	Fine-tuning
Wrong tone / style despite prompt instructions	Behavior gap	Better prompting (few-shot), then fine-tuning	RAG
Model misuses custom tools / function calls	Tool-use behavior	Fine-tuning on tool demonstrations	RAG
High latency from large system prompt	Context overhead	Prompt caching, then fine-tuning	RAG (does not help)
Very high inference cost at scale	Model tier or context length	Smaller model + fine-tuning (distillation)	Fine-tuning a larger model
Retrieval returns wrong documents	Retrieval quality	Embedding fine-tuning + reranking	Generator fine-tuning

Latency Reference: What Each Approach Costs in Milliseconds

Latency is not an afterthought — it is an architectural constraint that should influence your choice at design time, not after you ship. These are approximate P50 values for a typical enterprise production setup:

Prompt engineering only (no retrieval): 400–900ms for a 2K-token prompt on GPT-4o mini; 600–1,500ms on Claude 3.7 Sonnet
RAG pipeline (embed + search + generate): 800–2,000ms P50 (add 200–400ms for a reranking step)
Fine-tuned model serving (self-hosted LoRA): 150–400ms P50 on a single H100, for a 7–8B parameter model at batch size 1
Fine-tuned model API (OpenAI fine-tuned GPT-4o mini): 500–1,200ms, comparable to base model latency
RAG + fine-tuned generator: 900–2,200ms (retrieval adds to base model latency)

The latency advantage of fine-tuning over RAG is real but conditional: it applies only if you are self-hosting the fine-tuned model. If you are using a managed fine-tuning API, the latency profile is similar to the base model, and retrieval adds on top of that if you combine them. The infra side of self-hosting — vLLM, batching, GPU selection, autoscaling — is covered in our deploying LLMs to production guide.

Lesson learned

When evaluating latency, measure P95, not P50. A RAG pipeline that averages 900ms has a P95 of 2,400ms if the reranker or vector search occasionally spikes. Users notice the outliers, not the averages. Instrument every stage of your pipeline from day one and set latency SLOs per stage, not just end-to-end. The reranker is almost always the latency surprise.

Combining All Three: The Production-Grade Stack

The framing of "RAG vs fine-tuning" is a false dichotomy for serious production systems. The mature pattern is all three, layered to solve distinct problems.

Here is what the full stack looks like:

Fine-tuned embedding model: Instead of using a generic embedding model like text-embedding-3-small, fine-tune an open-source embedding model (BGE-M3, E5-large-v2) on domain-specific query-document pairs. Retrieval recall on domain vocabulary improves by 15–30 percentage points. This is one of the highest-ROI fine-tuning investments available because it improves every downstream RAG query. The HuggingFace PEFT library is the standard toolkit for this.
RAG pipeline for dynamic knowledge: Standard retrieval-augmented generation injecting current, traceable, updatable documents at inference time. Everything from your knowledge base, policies, and product documentation lives here — not in model weights.
Fine-tuned generator for behavioral consistency: A LoRA adapter on the generator model enforces output schema, stylistic constraints, and function-calling patterns. The model does not know more facts — it behaves more predictably.

This architecture is not theoretical. The LoRA paper and the subsequent QLoRA paper established that adapter-based fine-tuning can be composed with retrieval without degrading base model quality. Production teams at Cohere, Anthropic, and in open-source communities have published results confirming that fine-tuned embeddings + fine-tuned generators + retrieval consistently outperform any single approach in isolation on domain-specific benchmarks.

When the combined stack is justified

Not every system needs all three layers. The combined stack is justified when:

You have a specialized domain with non-standard vocabulary that generic embeddings handle poorly (legal, medical, industrial)
You need both current knowledge (RAG) and strict behavioral consistency (fine-tuning)
You are operating at scale where retrieval quality directly translates to measurable business metrics
You have the evaluation infrastructure to measure each layer's contribution independently

If you are in the early stages of a project, build a solid RAG pipeline first. Add fine-tuning when you have identified a specific, measurable behavioral gap that RAG and prompting cannot close. Add embedding fine-tuning when you have production retrieval data showing systematic domain vocabulary failures. See our guide on Agentic RAG for how this stack extends when you need multi-step reasoning on top of it.

Maintenance Burden: The Cost Nobody Talks About

The one-time training cost is not the real cost of fine-tuning. The real cost is the maintenance burden that begins the day you deploy a fine-tuned model and never ends.

Prompting has near-zero maintenance burden. Update the system prompt, run your eval suite, deploy. The model does not need to be retrained when your requirements change.

RAG requires ongoing maintenance of the document corpus, embedding pipeline, and retrieval quality. This is real work — a production RAG system is a living data pipeline — but it is standard software maintenance, not specialized ML work. When documentation changes, you update the index. No retraining.

Fine-tuning creates a maintenance dependency on your training data. When the behavior you trained needs to change, you need to: (1) update the training data, (2) re-run the fine-tuning job, (3) evaluate the new model on your full eval suite including regression testing, (4) deploy the new adapter or weights. For a LoRA adapter, this cycle takes 1–2 weeks of engineering time. For a full fine-tuned model, it can take 4–8 weeks. If your requirements change quarterly, you are running a continuous fine-tuning operation.

This is why fine-tuning is appropriate for stable behaviors — things that are unlikely to change frequently — and inappropriate for knowledge or policies that evolve over time. The OpenAI fine-tuning documentation and Anthropic's guidance on fine-tuning both frame this correctly: fine-tuning is for teaching the model a persistent skill, not for keeping it informed.

The "Distill" Layer: When You Have Both Budget and Scale

There is a fourth step that rarely comes up in the prompting-vs-RAG-vs-fine-tuning debate: distillation. It is worth naming because it is the right answer for a specific class of high-scale, cost-sensitive problems.

Distillation means using a large frontier model (GPT-4o, Claude 3.7 Opus) to generate a high-quality labeled dataset, then fine-tuning a much smaller model (Llama 3 8B, Mistral 7B) on that dataset. The goal is not to match the frontier model's general capability — it is to match its performance on your specific task at 10–20x lower inference cost.

The math when this makes sense: if you are running 500,000 queries/day on GPT-4o at $5/M tokens with a 2,000-token average context, you are spending ~$1,500/day or $45,000/month. A well-distilled 8B model on self-hosted H100s running at $2/GPU-hour can serve those queries for ~$200–400/day. The breakeven on the distillation investment (compute + data curation + eval, typically $20,000–80,000) is 1–3 months.

Distillation is not a shortcut for teams without eval infrastructure. It only works if you can measure quality rigorously — otherwise you will ship a smaller model that is confidently wrong in ways the frontier model never was.

Practical Starting Point for Each Team Profile

Different teams are at different stages. Here is a concrete starting point based on where you are.

Starting from scratch

Start with a frontier model (GPT-4o mini, Claude 3.5 Haiku) and invest 2–3 weeks in serious prompt engineering with a golden eval set of 50–100 examples. If knowledge failures remain after that, add RAG. Fine-tuning should not appear in your roadmap until you have a working system, a production eval pipeline, and a specific, measurable behavioral gap that prompting and RAG cannot close.

If you already have a RAG pipeline with quality issues: Before considering fine-tuning, read our article on production RAG failure modes. Ninety percent of RAG quality issues are retrieval or evaluation problems, not model problems. Fix the retrieval before touching the model.

If you are running a fine-tuned model and quality is degrading: The first question is whether your training data distribution still matches your production query distribution. Run your fine-tuned model against a held-out set of recent production queries. If there is significant regression compared to a zero-shot frontier model, you have a data drift problem, not a model problem.

If you are deciding between an open-source and a closed model: For prompting and RAG, the frontier closed models (GPT-4o, Claude 3.7) almost always outperform open-source equivalents below 70B on complex reasoning tasks. For fine-tuning, the calculus shifts: a LoRA-adapted Llama 3 8B on a narrow task can match a frontier model at 20x lower inference cost. Our LLM integration service covers this tradeoff in depth for production deployments.

Frequently Asked Questions

Fine-tune when you need to change how the model behaves, not what it knows. Valid use cases: enforcing a consistent output schema, adapting writing style or tone that prompt instructions cannot reliably produce, teaching function-calling patterns the base model handles poorly, and compressing a large stable system prompt into model weights for latency or cost reasons. If the problem is "the model doesn't know X", that is a RAG problem, not a fine-tuning problem.

A LoRA or QLoRA run on Llama 3 8B on a single H100 (80GB) for 8–12 hours costs roughly $50–300 in cloud compute depending on provider and reserved vs on-demand pricing. A full supervised fine-tuning run on a 70B model across multiple GPUs ranges from $5,000 to $50,000+. These numbers exclude data curation, which is typically 2–5x the compute cost in engineering time — making total project costs $2,000–$50,000+ depending on scope.

Yes. Fine-tuning on factual data (product specs, internal policies, recent events) is one of the most persistent misconceptions in LLM engineering. Models do not store facts as discrete addressable memories — they store statistical patterns. Facts baked into weights cannot be updated without retraining, and the model will hallucinate with more confidence because it was trained on that domain vocabulary. Use RAG for anything that is factual, updatable, or needs a traceable source. Fine-tuning is for behavior, not knowledge.

For LoRA adaptation on a well-curated, narrow task, 500–2,000 high-quality examples are often sufficient. For full supervised fine-tuning of a 7B–13B model, expect 5,000–50,000+ examples. Quality dominates quantity: 500 expert-reviewed examples consistently outperform 10,000 noisy ones. Instruction-following fine-tuning for a very narrow task (e.g., structured output extraction) can work with as few as 200–300 examples if the base model is strong and the task is well-defined.

Yes, and this is the production-grade pattern for serious applications. The standard stack is: a fine-tuned embedding model (better domain retrieval) + RAG pipeline (dynamic knowledge injection) + a fine-tuned generator (consistent output behavior). Each layer solves a different problem. Fine-tuning the embedding model on domain query-document pairs improves retrieval recall by 15–30 percentage points. Fine-tuning the generator enforces output schema and style. RAG keeps the factual knowledge current. They are not mutually exclusive.

For a small-to-medium deployment (10,000–50,000 queries per day), expect $270–2,500 per month managed (cloud APIs) or $200–730/month self-hosted. The main components are vector database ($70–500/mo), embedding API ($50–300/mo), LLM inference ($100–1,500/mo), and optional reranking ($50–200/mo). Self-hosted stacks using Qdrant and open-source embedding models can significantly reduce these figures at the cost of infrastructure management overhead.

The Bottom Line

Most teams should not be fine-tuning. They should be writing better prompts, building a real RAG pipeline, and shipping a production eval suite — in that order. Fine-tuning is not a quality upgrade; it is a surgical tool for a narrow class of behavioral problems that simpler approaches cannot solve.

The mental model that unlocks every downstream decision: RAG is for knowledge, fine-tuning is for behavior, prompting is for instructions. When the failure signal maps cleanly to one of those three categories, the right approach is obvious. When it does not map cleanly, the answer is almost always "you have not diagnosed the root cause yet" — and running an eval suite is the diagnostic.

If your team is stuck in the prompting-vs-RAG-vs-fine-tuning decision, a structured AI audit will give you a clear architecture recommendation with concrete data. We have audited enough LLM systems to know exactly what questions to ask and what failure patterns to look for. See our LLM integration service for what the implementation engagement looks like.

Fine-tuning vs RAG vs Prompting: An Engineering Decision Framework

The Core Mental Model

Step 1: Prompting Is Probably Enough

When prompting has genuine limits

Step 2: Add RAG for Knowledge Problems

RAG infrastructure cost in 2026

When RAG is not the right tool

Step 3: Fine-tune for Behavior, Not Knowledge

The legitimate use cases for fine-tuning

Fine-tuning cost benchmarks for 2026

Data requirements: the hard gate

The Decision Matrix

Latency Reference: What Each Approach Costs in Milliseconds

Combining All Three: The Production-Grade Stack

When the combined stack is justified

Maintenance Burden: The Cost Nobody Talks About

The "Distill" Layer: When You Have Both Budget and Scale

Practical Starting Point for Each Team Profile

Frequently Asked Questions

Further Reading

The Bottom Line