The model worked perfectly in development. In production, with 40 concurrent users, latency tripled, GPU memory saturated, and the inference bill came in at 8x the initial estimate. The model was not the problem. The infrastructure was. This is the gap between a demo and a production system, and it catches most teams off-guard.
Tutorials show you how to run a model in three lines of code. They do not show you how to run it reliably at scale, with real users, controlled costs, and 99% uptime, for months. This article covers what that actually requires: the self-host vs API decision, inference engine selection across vLLM, TGI, TensorRT-LLM, llama.cpp, and Triton, batching mechanics (continuous batching, PagedAttention), quantization at serving time (AWQ, GPTQ, INT8, FP8), GPU selection with real hardware numbers, Kubernetes deployment patterns, autoscaling for LLM workloads, observability, and actual cost benchmarks in dollars per million tokens.
This is not a tutorial. It is the engineering guide we wish had existed the first time we deployed an LLM for a production use case. For context on whether you should self-host at all versus defaulting to API, start with the build vs. buy decision framework before committing to infrastructure.
Why LLM serving is fundamentally different from serving a web API
If you have shipped standard API services before, your mental model needs updating before you touch an LLM deployment. The operational differences are not incremental — they are structural.
GPU is non-negotiable for interactive use. A 7B parameter model in FP16 occupies 14GB of VRAM. Running it on CPU gives you 2–10 tokens/second, which means a typical 200-token response takes 20 seconds to a minute. That is not a latency budget any interactive application can tolerate. For async batch workloads (overnight document processing, batch classification), CPU is viable. For anything real-time, you need a GPU.
Latency is an order of magnitude higher than you expect. A well-optimized REST endpoint returns in 5–50ms. An LLM inference call for a 200-token response at 100 tokens/second takes 2 seconds before any network overhead. At 50 tokens/second, that is 4 seconds. Your P95 will be significantly worse under concurrent load. Plan for this in your UX — streaming output is almost always the right answer, because users tolerate waiting for text that appears to be typing rather than blank screen followed by sudden output.
Memory consumption is request-dependent, not constant. Every active request holds a KV cache in GPU VRAM proportional to its context length. A single user with a 32K token context can consume as much VRAM as 16 users with 2K token contexts. This makes capacity planning non-trivial: your GPU that handles 30 concurrent short requests may saturate on 5 concurrent long ones. The traditional "requests per second" metric is insufficient — you need tokens in context as your primary capacity metric.
You pay per token, not per request. A malformed prompt that sends 4,000 tokens of context when 400 would suffice multiplies your inference cost by 10x on that call. At scale, prompt engineering is cost engineering. This applies to API usage and, more subtly, to self-hosted deployments where inefficient prompting wastes GPU cycles that could serve other requests.
Horizontal scaling requires a GPU per replica. Adding a replica to a CPU-based microservice costs $20–50/month on most platforms. Adding an LLM serving replica costs the price of a GPU: $800–3,000/month for cloud GPU instances, or $8,000–40,000 in hardware capex for on-premise. This changes the economics of redundancy, geographic distribution, and failover in ways that matter at the architecture stage, not the ops stage.
Lesson learned
On a client deployment, 10 sales reps hitting the LLM endpoint simultaneously at 9am on Monday morning saturated an A100 that had seemed massively over-provisioned during development testing. Peak concurrent requests, not average throughput, is what sizes your GPU. Profile your request arrival pattern before you pick your instance class.
Self-host vs API: making the right decision for your workload
The decision is primarily financial and operational, not technical. Both options produce good results when executed correctly. The question is which one makes sense at your volume and with your team's capabilities.
When the API is the right answer
Proprietary APIs (OpenAI, Anthropic, Mistral, Google) are the right starting point for most teams. Zero infrastructure to manage, access to the highest-capability models available, automatic updates, and usage-based pricing that scales to zero when idle. The unit cost feels high per token but the total cost of ownership is low when you account for the absent engineering overhead.
The API makes sense when: you are in prototyping or early production, volume is below roughly 50M tokens/month, your data can legally transit the provider's infrastructure, and you do not need model customization. If your use case is primarily prompting or RAG rather than fine-tuning, the API handles both cleanly.
When self-hosting pays off
Self-hosting an open-weight model (Llama 3, Mistral, Qwen, Phi) becomes economically rational at roughly 50–200M tokens per month. The range is wide because it depends on which API you are replacing, which open-weight model covers your quality bar, and whether you are renting GPU or amortizing owned hardware.
The break-even math is straightforward. At $3/M tokens (a typical midrange API price for a capable model), 100M tokens/month costs $300. A single A100 80GB cloud instance runs $1,500–2,500/month depending on provider and commitment. At $3/M tokens you need 500–833M tokens/month to justify that GPU — but if you are using a $0.50/M token smaller model from a provider like Mistral's API, the break-even is much higher and self-hosting is harder to justify.
Self-hosting makes unambiguous sense when: data sovereignty requirements prohibit sending prompts to US-based infrastructure, volume exceeds 200M tokens/month on a midrange model, you need fine-tuned model variants (see our LoRA/QLoRA fine-tuning guide for the serving implications of multi-adapter deployments), or latency requirements demand co-located inference.
The hybrid architecture
In practice, the most cost-efficient deployments we have built use a tiered approach: a self-hosted smaller model handles 80–90% of requests (routine queries, extraction, classification), and an API call to a frontier model handles the complex cases requiring deep reasoning. The smaller model runs 24/7 on dedicated GPU. The frontier API gets called on-demand. This architecture routinely cuts total inference cost by 60–70% compared to routing everything through a frontier API, while preserving quality for the cases that actually need it.
Inference engines compared: vLLM, TGI, TensorRT-LLM, Triton, llama.cpp, Ollama
The inference engine is the software layer that loads your model weights, manages GPU memory, handles concurrent requests, and exposes an API. Choosing the wrong engine for your workload is one of the most common and most expensive mistakes in LLM deployments. Here is how the main options compare.
LLM inference engine comparison, 2026
Based on production deployments and public benchmarks. Throughput figures for Llama 3 8B AWQ 4-bit on A100 80GB with continuous batching.
| Engine | Throughput (tok/s) | Quantization | OpenAI API | Best for |
|---|---|---|---|---|
| vLLM | 1,500–2,500 | AWQ, GPTQ, FP8, INT8 | Native | High-concurrency production |
| TGI (HuggingFace) | 1,100–1,800 | GPTQ, AWQ, EETQ | Native | HuggingFace ecosystem |
| TensorRT-LLM | 2,500–4,000+ | FP8, INT8, INT4 | Via Triton | Max throughput, NVIDIA hardware |
| Triton Inference Server | Variable (backend-dependent) | FP8, INT8 (via backends) | No (native protocol) | Multi-model, ML pipelines |
| llama.cpp | 50–150 (4090, single-stream) | GGUF (2–8 bit) | Compatible | CPU/consumer GPU, edge |
| Ollama | Similar to llama.cpp | GGUF | Compatible | Local development only |
vLLM: the production default
vLLM is the engine we use on the majority of deployments. Its decisive advantage is PagedAttention — a KV-cache memory management system inspired by OS virtual memory paging. Instead of pre-allocating a contiguous VRAM block per request at maximum sequence length, PagedAttention allocates memory in fixed pages on demand, eliminating fragmentation and wasted reserved-but-unused cache.
The practical consequence is that vLLM can serve 2–4x more concurrent requests on the same GPU compared to naive implementations. On a single A100 80GB running Llama 3 8B AWQ 4-bit, you can sustain 30–50 concurrent requests comfortably. The same model on a naïve HuggingFace inference loop saturates at 5–8 concurrent requests. Combined with continuous batching (new requests join in-flight batches without waiting for current ones to finish), vLLM is the right choice for any deployment where concurrency matters.
vLLM exposes an OpenAI-compatible API, which means replacing an OpenAI API call with a self-hosted model call is a one-line config change. It supports tensor parallelism across multiple GPUs for models too large for a single card. For teams running fine-tuned LoRA adapters, vLLM's multi-LoRA serving feature allows switching between adapters at inference time without reloading the base model — covered in detail in our LoRA/QLoRA guide.
TensorRT-LLM: maximum throughput, maximum friction
NVIDIA's TensorRT-LLM consistently delivers the highest throughput numbers on NVIDIA hardware — 20–50% higher than vLLM on the same GPU in many benchmarks. It achieves this through deep kernel fusion, FP8 quantization optimized for H100 tensor cores, and ahead-of-time model compilation. The trade-off is significant deployment friction: you compile an engine file for each model-GPU-quantization combination. Engine compilation can take 30–90 minutes. Changing your deployment configuration means recompiling. There is no dynamic model loading.
TensorRT-LLM is the right choice when you have a fixed model, fixed hardware, and throughput requirements that vLLM cannot meet. At scale (tens of thousands of requests per hour on H100 clusters), the throughput gains justify the operational overhead. For most production deployments under that scale, the vLLM flexibility-to-throughput trade-off wins.
TGI: the HuggingFace path
Text Generation Inference from HuggingFace is the natural choice if your model workflow is already centered on the HuggingFace Hub. Deployment is a single Docker command and model loading is seamless. TGI implements continuous batching and flash attention and supports GPTQ and AWQ quantization. On throughput, it trails vLLM by 20–40% in most head-to-head comparisons, but the deployment simplicity is a real advantage for teams without dedicated MLOps capacity.
llama.cpp and Ollama: development tools, not production tools
llama.cpp is an impressive piece of engineering that runs quantized GGUF models efficiently on CPU and consumer GPUs. On a single RTX 4090, Llama 3 8B Q4_K_M delivers 50–150 tokens/second in single-stream mode — fast enough for an interactive internal tool with one or two concurrent users. The moment you hit three or more concurrent users, the absence of proper batching and the single-thread bottlenecks become visible in latency. Ollama wraps llama.cpp with a friendlier interface but inherits the same architectural constraints.
Use Ollama for local development. Use llama.cpp as the backend for edge deployments where GPU access is not available. Do not use either for multi-user production workloads.
Continuous batching and PagedAttention: how modern serving works
Understanding these two mechanisms is not optional for anyone making infrastructure decisions. They are the reason vLLM and TensorRT-LLM dramatically outperform naive serving implementations, and they explain constraints you will encounter in production.
Continuous batching
Static batching waits until a batch of requests is assembled, processes them together, and returns all results when the last one finishes. If one request generates 500 tokens and another generates 20, the short one waits idle for the long one to finish. GPU utilization is poor and P95 latency is dominated by the slowest request in the batch.
Continuous batching (also called iteration-level scheduling) processes token generation one iteration at a time. Each iteration, the scheduler can add new requests to the running batch or remove completed ones. A short request that finishes in 20 tokens is immediately replaced by a waiting request without waiting for longer generations to complete. GPU utilization increases substantially and P95 latency improves because short requests are not blocked by long ones. vLLM, TGI, and TensorRT-LLM all implement continuous batching — it is table stakes for production serving.
PagedAttention
The KV cache (key-value cache) stores the intermediate attention computation for each token in the context window. For autoregressive generation, you need to keep the KV cache for every token you have generated so far — the cache grows linearly with response length. In naive implementations, you pre-allocate VRAM for the maximum possible sequence length when a request starts. A request with a 128-token max sequence length holds the same VRAM as one that actually generates 1,024 tokens. This fragmentation wastes 30–60% of KV cache memory in typical workloads.
PagedAttention divides the KV cache into fixed-size blocks (pages), allocates them on demand as generation proceeds, and reclaims them immediately when a request completes. The result: near-zero waste, higher effective VRAM utilization, and the ability to serve significantly more concurrent requests. On workloads with highly variable response lengths — which is almost every production workload — PagedAttention is the primary driver of vLLM's throughput advantage over implementations without it.
Quantization at serving time: AWQ, GPTQ, INT8, FP8
Quantization compresses model weights from their native precision (BF16 or FP16) to lower bit-widths. The payoff is direct: lower VRAM requirement, faster memory-bandwidth-bound inference, and the ability to run larger models on cheaper hardware. The question is which format to use and where the quality cliff is.
The formats you need to know
AWQ (Activation-Aware Weight Quantization) is our default for production serving with vLLM. It quantizes weights to 4 bits while preserving salient weights identified by activation magnitude analysis. Quality degradation versus BF16 is below 2% on general benchmarks, and on narrow task-specific benchmarks (extraction, classification, structured generation), the gap is often undetectable. VRAM use drops 60–70% compared to BF16. Pre-quantized AWQ models for all major open-weight models are available on the HuggingFace Hub.
GPTQ is the original 4-bit quantization format, slightly inferior to AWQ in quality preservation but equally well supported. If you need to quantize your own fine-tuned model and do not want to run AWQ calibration, GPTQ is simpler to apply. Both vLLM and TGI handle it natively.
INT8 and FP8 are 8-bit formats that preserve more quality than 4-bit at the cost of less compression. FP8 is particularly relevant for H100 deployments — the H100's tensor cores have native FP8 support that makes FP8 TensorRT-LLM deployments faster than FP16 while retaining quality close to BF16. If you are on H100 hardware and running TensorRT-LLM, FP8 is the right precision choice.
Concrete numbers from production
For Llama 3 8B on a single A100 80GB:
- BF16 (no quantization): 16GB VRAM, ~40–60 tokens/second single-stream, ~1,500–2,000 tokens/second aggregate with vLLM batching
- AWQ 4-bit: 5.5GB VRAM, ~55–70 tokens/second single-stream, ~1,800–2,500 tokens/second aggregate (smaller model, same memory bandwidth)
- GGUF Q4_K_M (llama.cpp, RTX 4090): ~4.5GB VRAM, 50–150 tokens/second single-stream
For Llama 3 70B, the quantization impact is even more significant. BF16 requires 140GB of VRAM — two A100 80GB cards with tensor parallelism. AWQ 4-bit fits in 40GB, running on a single A100 80GB with room for a healthy KV cache. The hardware cost difference is roughly 2x. On H100 SXM5, aggregate throughput for Llama 3 70B AWQ 4-bit with vLLM reaches 3,000–5,000 tokens/second, enough to serve dozens of concurrent users comfortably.
Lesson learned
We have never encountered a production use case — extraction, classification, Q&A over documents, structured generation — where AWQ 4-bit quality was insufficient. The anxiety about quantization degradation is almost always disproportionate to the actual impact. Run your task-specific eval on AWQ 4-bit before defaulting to BF16. You will almost certainly keep the quantized version.
GPU selection: H100 vs A100 vs L40S vs consumer hardware
GPU selection has a direct multiplier effect on your monthly bill. The right choice depends on your model size, concurrency requirements, and whether you are renting or buying.
The current GPU landscape for LLM inference
NVIDIA H100 SXM5 (80GB). The performance leader. 3.35 TB/s memory bandwidth (versus 2 TB/s for A100), native FP8 tensor cores, NVLink for multi-GPU configurations. Aggregate throughput for Llama 3 8B AWQ with vLLM: 3,000–5,000 tokens/second. Cloud cost: $3.50–5.50/hour. The right choice for large-scale deployments, 70B+ models, or latency-sensitive products where GPU amortization makes the premium worthwhile.
NVIDIA A100 80GB. The production workhorse. Mature ecosystem, excellent driver and framework support, 80GB VRAM fits most 70B models in 4-bit quantization. Aggregate throughput for Llama 3 8B AWQ: 1,500–2,500 tokens/second. Cloud cost: $1.80–3.00/hour. For the majority of production deployments, this is still the best balance of capability, cost, and availability.
NVIDIA L40S (48GB). The most interesting price-performance option in 2026 for inference workloads. 48GB VRAM comfortably handles most 7B–34B models in any quantization, and handles 70B AWQ 4-bit (~35GB). Memory bandwidth lower than A100, but competitive for latency-critical single-stream workloads. Cloud cost: $1.20–1.80/hour. Significantly cheaper per GPU than A100 while covering the vast majority of production use cases. Our default recommendation for new deployments that do not need 70B in FP16 or multi-model parallelism.
Consumer GPUs (RTX 4090, RTX 3090). 24GB VRAM. Running Llama 3 8B AWQ 4-bit leaves ample KV cache room; running 70B requires 2-bit quantization with noticeable quality loss. The 4090 delivers 50–150 tokens/second single-stream with llama.cpp or 400–800 tokens/second aggregate with vLLM for well-fitting models. No ECC memory (silent data corruption risk), no multi-instance GPU (MIG) support, no enterprise support SLA. Viable for low-traffic internal tools where a single user or a handful of users is the ceiling. Not appropriate for anything facing more than a few simultaneous requests.
On-premise vs cloud GPU
On-premise hardware (a server with one or two A100s) has a payback period of 12–18 months versus cloud rental at comparable utilization rates. It makes sense when: you have the operational team to manage hardware, your workload is predictable enough to justify fixed capacity, and data sovereignty requires infrastructure you physically control. For most teams, cloud GPU is the right starting point — validate your usage patterns for 3–6 months, then evaluate on-premise if the math supports it.
Kubernetes deployment patterns for LLM workloads
Running LLM inference on Kubernetes is straightforward in theory and full of subtle issues in practice. The core abstractions — pods, deployments, services — apply, but several LLM-specific patterns are worth knowing before you burn time discovering them.
The vLLM Docker Compose baseline
Before Kubernetes, start here. A Docker Compose configuration gives you a reproducible single-node deployment that is easier to debug and iterate on. Here is the configuration we use as the starting point for new deployments:
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
volumes:
- ${HOME}/.cache/huggingface:/root/.cache/huggingface
ports:
- "8000:8000"
command: >
--model mistralai/Mistral-7B-Instruct-v0.3
--quantization awq
--max-model-len 8192
--max-num-seqs 64
--gpu-memory-utilization 0.90
--enable-prefix-caching
--served-model-name mistral-7b
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 120s
Key parameters to understand: --max-model-len caps the maximum context window and directly bounds KV cache memory usage. --max-num-seqs limits concurrent sequences and prevents GPU OOM under burst load. --gpu-memory-utilization 0.90 leaves 10% VRAM headroom for the CUDA runtime and prevents fragmentation-related OOM. --enable-prefix-caching caches common prompt prefixes (system prompts, few-shot examples) across requests — a significant throughput improvement for workloads with shared prompt structure.
Kubernetes deployment considerations
When moving to Kubernetes, the main operational differences from standard workloads are:
Resource requests and limits. Always set nvidia.com/gpu: 1 in resource limits. Without explicit GPU resource claims, the Kubernetes scheduler can place multiple pods on the same node without accounting for GPU memory, leading to OOM failures at startup or under load.
Readiness probes need patience. A pod that passes readiness immediately after container start will start receiving traffic before the model has finished loading into VRAM. Model loading takes 2–8 minutes depending on model size and storage speed. Set your readiness probe initialDelaySeconds to at least 120–300 seconds, and probe the /health endpoint rather than just a TCP connection check.
Pod disruption budgets. Node maintenance events (kernel updates, driver updates) cause pod eviction. With LLM pods that take minutes to restart, an eviction during a traffic peak is a meaningful incident. Set a PodDisruptionBudget with minAvailable: 1 if you have multiple replicas, and plan maintenance windows accordingly.
Storage class for model weights. Pulling a 14GB model tarball from HuggingFace Hub on every pod start adds 5–10 minutes to cold start time (depending on network and storage). Use a persistent volume backed by fast block storage and pre-cache weights there, or use a ReadOnlyMany NFS volume shared across pods. Model weight caching is one of the highest-ROI cold-start optimizations available.
Autoscaling LLM workloads: why standard HPA fails you
Kubernetes Horizontal Pod Autoscaler (HPA) scales on CPU, memory, or custom metrics. For standard web services, this works well — new pods start in seconds and are ready to serve in under a minute. For LLM serving, HPA creates a false sense of security that will fail you when you need it most.
The cold start problem
A new LLM serving pod has to: start the container, initialize CUDA, load model weights into CPU RAM, transfer weights to GPU VRAM, and warm up the inference engine. End-to-end, this takes 5–15 minutes depending on model size, storage speed, and GPU initialization time. A traffic spike that triggers scale-out at 9:00am will not have additional capacity available until 9:10–9:15am. For most traffic spikes, this is too late to matter.
Standard HPA reaction time plus LLM cold start time makes reactive scaling essentially useless for handling sudden traffic increases. You need different patterns.
Pattern 1: predictive scaling
Analyze your traffic patterns over 2–4 weeks. If 80% of your request volume arrives between 8am and 6pm on weekdays with a peak at 9–11am, schedule scale-out ahead of that peak using a Kubernetes CronJob or KEDA's cron scaler. Pre-provision the second replica at 8:45am so it is ready by 9:00am. Scale down at 7pm. For predictable workloads, this eliminates most cold start pain with minimal wasted GPU hours.
Pattern 2: warm standby
Keep a minimum replica count of 2 at all times. The second replica is immediately available to absorb traffic spikes without any cold start. Cost: one continuously running GPU instance. This is the right choice when: traffic is unpredictable, you cannot afford degraded service during spikes, or the cost of the warm standby is small relative to the cost of service degradation.
Pattern 3: request queuing with API fallback
Implement a request queue (Redis Streams, RabbitMQ, or a simple async queue in your API gateway) in front of the LLM serving layer. When queue depth exceeds a threshold, overflow requests are routed to an API provider rather than rejected. This architecture is the most resilient: it handles sudden spikes without over-provisioning, prevents request drops, and keeps users served even during infrastructure events. The cost is the API tokens consumed during overflow, which is typically 5% or less of total traffic in a well-dimensioned deployment.
Lesson learned
KEDA (Kubernetes Event Driven Autoscaler) with a custom queue-depth metric is significantly more effective than standard HPA for LLM workloads. You scale based on the backlog in your request queue, not on current GPU utilization. This gives you a meaningful signal earlier and allows you to start new pods before the GPU is already overwhelmed. Pair it with a 5-minute scale-down cooldown to avoid thrashing.
Observability: what to instrument and why it matters
An LLM in production fails silently in ways that no traditional monitoring catches. There is no HTTP 500 when the model starts hallucinating. There is no CPU spike when response quality degrades. Without purpose-built observability, you discover production quality issues through user complaints weeks after they started.
Infrastructure metrics
The foundation: GPU utilization, VRAM used and free, memory bandwidth saturation, GPU temperature, and request queue depth. Use NVIDIA DCGM Exporter to expose these as Prometheus metrics. Alert on GPU temperature above 80°C (thermal throttling reduces throughput by 20–30%), VRAM utilization above 90% (OOM risk), and queue depth above your defined SLA threshold.
Inference metrics
vLLM exposes a Prometheus metrics endpoint at /metrics out of the box. The metrics you should be tracking:
- Time to First Token (TTFT): The most important latency metric for streaming applications. Users perceive TTFT as the application "thinking." Target TTFT under 500ms for interactive use. TTFT above 2 seconds causes observable frustration even when the subsequent generation is fast.
- Inter-Token Latency (ITL): The time between successive tokens in a streaming response. Consistent ITL means smooth streaming. Spikes in ITL indicate KV cache pressure or scheduler contention.
- Tokens per second (aggregate and per-request): Your throughput baseline. A dropping tokens/second trend is an early warning of approaching capacity limits.
- Latency P50, P95, P99: The median is not your user experience. P95 and P99 are where you find the requests that make users stop using your product. Track these separately by request type if you have meaningfully different workload classes.
- Request queue time: The time a request spends waiting before token generation starts. Rising queue time before VRAM or GPU utilization saturates indicates a scheduler configuration issue.
Quality metrics
This is the layer that most teams skip, and it is the one that matters for your actual product. For structured outputs, track format compliance rate (does the model return valid JSON when asked). For RAG-augmented applications, track faithfulness scores using an LLM-as-judge pipeline — see our article on building custom LLM evaluators for how to do this efficiently. For any interactive application, instrument explicit user feedback (thumbs up/down on responses) and track the ratio.
Quality drift is real. Model serving software updates, changes to your prompt templates, or shifts in input distribution can all degrade output quality without triggering any infrastructure alert. A weekly automated eval run against a golden test set is the minimum viable quality monitoring for a production LLM. Langfuse (open source) and LangSmith are the tools we use for LLM observability — both provide request tracing, quality scoring, and dashboards with minimal integration overhead.
Cost metrics
For self-hosted deployments, the cost analog is GPU utilization — idle GPU is wasted money. Track cost per request (GPU hour rate divided by requests served per hour) and set budget alerts. For workloads with mixed complexity, track cost separately by request type: a complex multi-turn conversation costs 10x a simple classification call, and cost anomalies at the request-type level often reveal prompt engineering problems before they appear on the GPU bill.
Real cost benchmarks: self-host vs API in 2026
The numbers that follow are based on actual deployments and current provider pricing as of May 2026. They assume typical production request distributions (average 500 input tokens, 200 output tokens per request). For the broader provider trade-offs at the model layer (not just $/1M tokens but tool use, structured outputs, context window, EU deployment), see Mistral vs OpenAI vs Anthropic. For the full RAG architecture built on top of a self-hosted stack, see self-hosted RAG architecture.
LLM inference cost comparison, May 2026
Assuming 500 input + 200 output tokens per request. Cloud GPU pricing at on-demand rates; reserved instances run 30–40% lower.
| Option | $/1M tokens (est.) | Monthly (10K req/day) | Break-even vs API |
|---|---|---|---|
| GPT-4o (OpenAI API) | $2.50–10.00 | $500–2,100 | N/A (reference) |
| Claude Sonnet (Anthropic API) | $3.00–15.00 | $630–3,150 | N/A |
| Mistral Large (Mistral API) | $2.00–6.00 | $420–1,260 | N/A |
| Llama 3 8B — vLLM on L40S (cloud) | $0.15–0.35 | $900–1,500 (GPU fixed) | ~30M–80M tokens/month |
| Llama 3 70B — vLLM on A100 (cloud) | $0.40–0.80 | $1,500–2,500 (GPU fixed) | ~100M–200M tokens/month |
| Llama 3 70B — vLLM on H100 (cloud) | $0.25–0.55 | $2,500–4,000 (GPU fixed) | ~150M–300M tokens/month |
| Llama 3 8B — on-premise RTX 4090 | ~$0.05 (electricity only) | $30–80 (electricity) | $4K–6K capex, ~6–12 months |
The numbers make the trade-offs concrete. At 10,000 requests per day with average request length, an L40S running Llama 3 8B costs $0.15–0.35/M tokens effective rate — roughly 10–20x cheaper than a frontier API. But the fixed GPU cost of $900–1,500/month means the break-even requires sustained high volume. At 1,000 requests per day, the API wins comfortably.
One number the table omits: the engineering labor cost of operating a self-hosted deployment. A maintained vLLM deployment with proper monitoring requires 1–2 engineer-days per month for updates, incident response, and configuration tuning. At typical engineering rates, this adds $500–2,000/month to the true cost. Include this in your ROI calculation. For context on the full investment picture, see our notes on LLM integration engagements and what the ongoing operational commitment looks like in practice.
Lesson learned
The teams that are most surprised by their self-hosting costs are the ones who modeled GPU price but not engineering overhead. The GPU is the variable cost that scales with usage. The engineering time is the fixed overhead that does not. Both belong in the denominator when you calculate cost per useful output.
Production architecture: putting it together
Here is the serving architecture we deploy for production workloads that need reliability, cost efficiency, and observability from day one:
- Model selection: smallest open-weight model that passes your task-specific quality eval. For most extraction, classification, and Q&A workloads, a 7B–14B model quantized to AWQ 4-bit is sufficient. For complex reasoning or long-form generation, 70B AWQ.
- Serving engine: vLLM with continuous batching, PagedAttention, and prefix caching enabled. Exposed as an OpenAI-compatible API so application code is inference-engine-agnostic.
- GPU hardware: L40S 48GB for 7B–34B workloads, A100 80GB for 70B or high-concurrency requirements. Reserved instances at cloud provider to reduce cost by 30–40%.
- Request layer: FastAPI or similar in front of vLLM handling authentication, rate limiting, prompt templating, and streaming. Redis queue for overflow management.
- Kubernetes deployment: 2 replicas minimum, KEDA autoscaling on queue depth, predictive scale-out for known peak windows. PodDisruptionBudget protecting availability during maintenance.
- Observability stack: Prometheus scraping vLLM and DCGM metrics, Grafana dashboards for infrastructure and inference KPIs, Langfuse for request tracing and quality scoring, PagerDuty alerts on TTFT P95 and error rate thresholds.
- API fallback: automatic overflow to Mistral API or similar when queue depth exceeds the SLA threshold. Transparent to the calling application.
This stack goes from zero to production in 2–4 weeks for a well-defined use case. If you are evaluating whether to build on RAG, pure prompting, or a fine-tuned model, resolve that architectural decision first — our fine-tuning vs RAG vs prompting comparison covers the trade-offs, and the structured outputs guide is relevant if your downstream system requires deterministic output formats.
FAQ: deploying LLMs to production
The break-even is roughly 50–200 million tokens per month, which translates to approximately 5,000–15,000 requests per day at typical prompt and completion lengths. Below that threshold, API pricing (even at $2–5/M tokens) beats the fixed cost of a cloud GPU. Above it, a single A100 80GB instance at $1,500–2,500/month amortizes quickly. The calculation also needs to include the engineering cost of operating the serving stack: budget 1–2 engineer-days per month for a maintained vLLM deployment.
For Llama 3 8B with vLLM and continuous batching, you can expect roughly 3,000–5,000 tokens/second aggregate throughput on an H100 80GB, 1,500–2,500 tokens/second on an A100 80GB, and 800–1,200 tokens/second on an L40S 48GB. For comparison, llama.cpp on a single RTX 4090 delivers 50–150 tokens/second in single-stream mode — fine for development, not for multi-user production. All figures assume AWQ 4-bit quantization; FP16 reduces throughput by roughly 30–40% for the same GPU.
In practice, AWQ 4-bit quantization on models from 7B to 70B parameters produces quality degradation below 2–3% on standard benchmarks. For domain-specific tasks like extraction, classification, and structured generation, the difference is typically imperceptible. VRAM usage drops 60–75%, inference speed increases slightly, and a model that required two A100s now fits on one. Run your task-specific eval before assuming you need BF16 precision. You almost certainly do not.
Standard web services scale by spinning up additional stateless containers in 5–30 seconds. LLM serving pods take 5–15 minutes to become ready: container start, model weight loading (often 10–30GB), GPU VRAM transfer, and inference engine warmup. This makes reactive scale-out ineffective for traffic spikes. The correct patterns are predictive scaling based on traffic history, warm standby GPU instances, and request queuing with an API fallback during extreme peaks.
PagedAttention manages GPU KV-cache memory using virtual memory paging. Instead of pre-allocating a contiguous VRAM block for each request's maximum sequence length — which wastes memory when requests are shorter — PagedAttention allocates fixed-size pages on demand. The result is near-zero KV-cache waste, which translates to 2–4x more concurrent requests on the same GPU compared to naive implementations. For production workloads with variable-length inputs, it is the single most impactful optimization available at the serving layer.
For 7B–34B models, the L40S 48GB is the best price-performance option at $1.20–1.80/hour cloud, handling AWQ 4-bit Llama 3 70B comfortably. For 70B models in higher precision or aggressive throughput, the A100 80GB is the standard choice. The H100 SXM5 is the performance leader at 2.5–3x the A100 cost — justified for large-scale deployments. Consumer GPUs (RTX 4090) work for single-stream internal tools but lack ECC memory and MIG support needed for reliable production use.
Further reading
- Fine-tuning vs RAG vs prompting — resolve the architectural decision before committing to inference infrastructure.
- LoRA and QLoRA: practical fine-tuning guide — if you are fine-tuning before deployment, the LoRA adapter serving section in vLLM is directly relevant.
- Production RAG failure modes — the 5 failure modes we keep seeing in RAG systems, with a strong focus on observability and cost.
- Structured outputs in production — constrained decoding, JSON mode, and reliability patterns for schema-compliant LLM output.
- Building custom LLM judges — the evaluation layer you need before you can trust your production quality metrics.
- Multi-agent orchestration compared — relevant when your LLM workload involves multi-step agent pipelines that change your serving architecture requirements.
- LLM integration service — Tensoria's end-to-end service for deploying production LLM systems, from model selection to serving infrastructure and monitoring.
- AI infrastructure audit — structured review of your current LLM deployment for cost, latency, and reliability gaps.
- vLLM official documentation — the primary reference for production serving configuration, quantization options, and multi-GPU setup.
- TensorRT-LLM on GitHub — NVIDIA's high-throughput inference library for maximum performance on NVIDIA hardware.
- Langfuse — open-source LLM observability for tracing, evaluation, and cost monitoring in production.
Talk to an engineer
Sizing your LLM infrastructure? We do this for production systems — 30 minutes to define the right architecture.
The decisions that actually matter
The model is the smallest part of a production LLM deployment. What determines whether the system runs reliably at reasonable cost is everything around it: the serving engine, quantization strategy, GPU selection, concurrency management, autoscaling architecture, and observability stack.
The practical checklist before you go to production:
- Do not default to the largest model. A 7B AWQ model that passes your task-specific eval is cheaper, faster, and easier to scale than a 70B model you chose because it felt safer.
- Use AWQ 4-bit quantization unless task-specific eval shows a material quality gap. You will be surprised how rarely there is one.
- Profile your peak concurrent load, not average load. Peak concurrent requests, not median throughput, determines whether your GPU saturates.
- Plan your cold start strategy before you ship. If you are on Kubernetes, reactive HPA will not protect you from burst traffic. Pick predictive scaling or warm standby before you go live.
- Instrument quality from day one. TTFT, P95 latency, and GPU utilization without quality tracking means you are flying blind on the dimension that matters most to your users.
If you are planning a production LLM deployment and want to pressure-test the architecture before committing to infrastructure, we run structured AI infrastructure audits that address exactly these decisions. See also our RAG systems service if your deployment includes a retrieval layer alongside the inference stack.