Table of contents
The default RAG stack in 2026 is still: OpenAI embeddings, a managed vector store, and GPT-4o or Claude for generation. That stack has low operational overhead, strong baseline performance, and zero GPU procurement headaches. For a large portion of use cases, it is the right call.
There is a different set of use cases where that default creates real problems: regulated industries where data residency is legally constrained, high-volume deployments where per-token API cost compounds to six figures annually, and IP-sensitive verticals where sending proprietary documents through a commercial inference endpoint is not a policy you can sign off on. For those situations, a fully self-hosted RAG architecture — open-weight LLM, self-managed vector store, on-premise inference — is not ideological but practical.
This guide covers the engineering decisions involved: which models to run, which inference stack to use, which vector databases hold up in production, how the cost model actually works, and where self-hosted deployments fail. If you want the foundations of RAG before going further, start with our RAG technical guide. If you already have a production RAG system and it is underperforming, read Production RAG: 5 failure modes we keep seeing first.
1. When self-hosting is actually justified
Self-hosting a RAG system means running your own inference servers, managing GPU capacity, operating your own vector database, and taking on the engineering overhead of a small internal ML platform team. That is a meaningful cost — not just in dollars, but in time, expertise, and operational risk. Do not do it unless you have a clear reason.
There are four situations where that trade-off holds up:
Compliance and data residency requirements
HIPAA. Sending PHI (protected health information) through a commercial LLM API requires a signed Business Associate Agreement. AWS, Azure, and GCP offer BAAs. Most LLM API providers do not, or their BAAs exclude model training data use in ways that create audit ambiguity. Healthcare organizations running clinical document retrieval — discharge summaries, clinical notes, lab results — typically cannot accept that ambiguity. On-premise inference eliminates the dependency.
GDPR and EU AI Act. Under GDPR, transferring personal data to US-based processors requires either an adequacy decision or Standard Contractual Clauses. The legal basis for transfers to US cloud providers has been challenged repeatedly. The EU AI Act, in progressive enforcement through 2026, adds transparency and traceability obligations that are structurally simpler to meet when you control the full inference chain. Organizations handling EU citizen data in insurance, legal, or financial services increasingly require in-region processing.
Sovereign cloud customers. Defense primes, government contractors, and critical infrastructure operators often have contractual or regulatory requirements specifying that data processing must remain within a defined perimeter — on-premise, national cloud, or classified network. Commercial API endpoints are categorically out of scope for these environments.
IP and confidentiality sensitivity
A RAG system over your internal knowledge base is, by design, exposing your most sensitive documents to the inference pipeline. For most enterprises, the commercial API terms of major providers are clear enough: data is not used for training. But in practice, legal teams at IP-intensive companies — pharmaceutical R&D, semiconductor design, M&A advisory — may not be comfortable accepting those terms as sufficient protection. Self-hosting removes the question entirely.
Cost at scale
API pricing makes sense at low volume and for prototypes. It stops making sense once token throughput compounds. The break-even analysis is in the cost model section below. The short version: for most setups, self-hosting becomes cheaper than GPT-4o-class APIs somewhere between 50M and 200M tokens per month. For embedding specifically, the crossover with OpenAI's cheapest model happens later — around 2–3B tokens per month — because their pricing is already very low.
Provider independence
This one is less dramatic but operationally real. API providers can change pricing, deprecate model versions, introduce rate limits, or experience outages. If your product SLA depends on LLM inference, building on a single external provider creates a dependency you cannot fully mitigate with retries. Running your own inference stack gives you version pinning, predictable throughput, and the ability to roll back to a previous model checkpoint if a new one regresses your evaluation metrics.
Lesson learned
We audited a legal-tech platform that had been running on OpenAI's API for 18 months. Their monthly token bill hit $40K when they expanded to a second customer segment. The team had always assumed they would "migrate later when it made sense." The migration cost them four months of infrastructure work because they had not designed for provider swappability from the start. The lesson: even if you start with an API, architect for swappability — an OpenAI-compatible interface layer costs almost nothing upfront and saves you from a painful migration later.
2. Open-weight LLM selection
The open-weight LLM landscape has genuinely converged to frontier-class quality at 70B parameter scale. The gap between GPT-4o and a well-quantized Llama 3 70B on standard RAG workloads — Q&A over internal documents, summarization, structured extraction — is in the range of 3–7 points on faithfulness metrics. That gap matters for some use cases and is imperceptible for others. Here is the decision tree we use.
Llama 3 70B: the production default
Meta's Llama 3 70B (and its 3.1/3.3 variants) is the most battle-tested open-weight model for production RAG. It has a strong instruction-following profile, an 128K context window in the 3.1 version, and a large ecosystem of fine-tunes and GGUF quantizations. In fp8 via vLLM on 2x H100 80GB, you get roughly 700–900 tokens per second of generation throughput with batch inference, which translates to sub-2-second P95 latency at moderate concurrent load.
Hardware requirement: 2x H100 80GB (fp8), or 4x A100 80GB (fp16). At H100 spot pricing of $2–3/hour per GPU, expect $4–6/hour for the inference cluster. On reserved 1-year pricing, that drops to roughly $2.50–3.50/hour total.
Mistral Small 3 (24B): the single-GPU pragmatist
Mistral Small 3 (24B parameters, Apache 2.0) runs on a single L40S or A100 80GB in fp16. It covers the vast majority of RAG use cases — factual Q&A, document summarization, procedure lookups — with 1–3 second latency and good multilingual performance. If your workload is moderate-volume and your primary language is not English, Mistral Small is often the better choice than Llama 3 70B because its European-language quality is stronger relative to its size. The 256K context window in Mistral Small 4 is also genuinely useful for long-document RAG where you want to inject more retrieved chunks without truncation.
Qwen 2.5 72B: for multilingual and code-heavy corpora
Qwen 2.5 72B (Alibaba, Apache 2.0) outperforms Llama 3 70B on MMLU and several coding benchmarks, and has stronger multilingual coverage across East Asian languages. If your document corpus mixes English with Chinese, Japanese, or Korean, or if your retrieval pipeline over codebases is a primary use case, Qwen 2.5 72B is worth benchmarking. Hardware requirements are similar to Llama 3 70B.
DeepSeek-V3: strong on reasoning, heavier on memory
DeepSeek-V3 (671B parameters total, MoE architecture with ~37B active) benchmarks exceptionally well on reasoning-heavy tasks — multi-document synthesis, contract analysis with interdependent clauses, financial report cross-referencing. However, the full model requires 8x H100 at minimum, which roughly doubles your infrastructure cost compared to a dense 70B model. For most RAG workloads where retrieval quality matters more than generation reasoning depth, the cost premium is not justified. Reserve DeepSeek-V3 for use cases where you have measured a meaningful gap with 70B-class models on your actual eval set.
What to avoid
Avoid 8B models for production RAG systems handling complex documents. Llama 3 8B and Mistral 7B have meaningfully higher hallucination rates on multi-document reasoning tasks. The infrastructure savings are real but the quality degradation shows up in user-facing errors. Use 8B models for prototyping, for single-document classification tasks, or for latency-critical classification-only sub-components where you have verified quality parity on your eval set.
Lesson learned
A client chose Llama 3 8B "for cost reasons" without benchmarking against their actual document corpus. Their faithfulness score was 0.71 on their internal eval set. Moving to Llama 3 70B in fp8 doubled infrastructure cost but raised faithfulness to 0.88 — which was the threshold where users stopped reporting wrong answers. The 8B model was not saving money; it was creating support tickets.
3. Embedding model selection
Your embedding model determines retrieval quality more than your LLM in most RAG systems. A poorly embedded corpus will starve even the best generator of relevant context. For the full treatment of embedding selection, see our embedding models guide. Here is the self-hosting-specific view.
BGE-M3: the self-hosted default
BGE-M3 (BAAI, MIT license) is the strongest general-purpose self-hostable embedding model as of 2026. It supports dense retrieval, sparse retrieval (lexical), and multi-vector (ColBERT-style) retrieval within a single model checkpoint, which means you can run hybrid search without deploying a separate sparse encoder. MTEB average retrieval score of ~54–56 across benchmarks. Runs on CPU for batch indexing (slow but functional), and on a single A10G or T4 GPU for real-time embedding at reasonable latency.
Key parameter: BGE-M3 produces 1024-dimensional dense vectors. If memory is a constraint in your vector store, consider truncating to 512 dimensions with Matryoshka-aware truncation — the quality degradation is minimal for most corpora.
E5-Mistral-7B-Instruct: for instruction-tuned retrieval
E5-Mistral-7B-Instruct (Microsoft, MIT license) is a 7B-parameter embedding model that uses instruction prefixes to condition retrieval on the query type. It tops several MTEB retrieval subtasks and is a strong choice when your queries are complex and heterogeneous — mixing factual lookups with reasoning-heavy questions. The trade-off: it requires a GPU for any real-time use (too slow on CPU), and at 7B parameters it is considerably larger than BGE-M3, which translates to higher memory footprint and higher batching latency.
NV-Embed-v2 and Voyage open weights
NVIDIA's NV-Embed-v2 ranks at the top of the MTEB leaderboard as of mid-2026. It is available under a research license — check current terms before using in a commercial deployment. Voyage AI's models are API-only (no open weights available at the time of writing). For deployments where you need top-of-leaderboard retrieval quality and can accept the license constraints, NV-Embed-v2 is worth evaluating. For fully open commercial use, BGE-M3 and E5-Mistral remain the cleaner choice.
Cross-encoder rerankers
Regardless of your first-stage embedding model, adding a cross-encoder reranker as a second retrieval stage typically improves precision@3 by 8–15 points on complex queries. BGE-Reranker-v2-M3 (BAAI, open weight) is the self-hosted default. Be aware of latency: a cross-encoder scores query-chunk pairs sequentially and adds 200–800ms per query depending on hardware and batch size. Use async execution or limit reranking to top-20 candidates from first-stage retrieval. This is one of the most common sources of latency regressions we see in production.
4. Self-hosted vector database
All three major self-hostable vector databases are production-ready. The choice depends on your existing infrastructure and query patterns. For a deeper comparison including managed cloud options, see our vector database comparison guide.
Qdrant: the production default for new deployments
Written in Rust, Qdrant is optimized for high-throughput ANN search with low memory overhead via scalar and product quantization. It supports native hybrid search combining dense HNSW with sparse BM25 vectors in a single query — useful for technical domains where exact keyword matching on product codes, contract numbers, or regulation identifiers matters alongside semantic similarity. Qdrant's payload filtering (filter by metadata before or during ANN search) is more performant under high-cardinality filters than Weaviate or pgvector. Deploy as a single Docker container for development, or as a Kubernetes StatefulSet with persistent volume claims for production. Scales to 100M+ vectors on a single node with quantization enabled.
pgvector: when you already run PostgreSQL
If your application already has a PostgreSQL deployment, pgvector is the pragmatic choice for corpora under ~5M vectors. Adding the extension costs nothing, your vectors live in the same transactional boundary as your document metadata, and you can join vector search results directly with relational filters in a single query. The HNSW index in pgvector (added in 0.5.0) closes most of the performance gap with dedicated vector databases at this scale. Beyond 5M vectors, query latency starts to degrade relative to Qdrant unless you tune ef_search and m parameters carefully — and you lose the operational simplicity that justified choosing pgvector in the first place.
Weaviate: for multi-tenant and multi-modal filtering
Weaviate shines when your RAG system serves multiple tenants with strict data isolation requirements, or when your retrieval logic involves combining semantic search with complex GraphQL-style property filters. Its native multi-tenancy model (each tenant gets isolated storage) is cleaner than implementing row-level security on top of Qdrant. If neither multi-tenancy nor multi-modal retrieval is in your requirements, Weaviate adds operational complexity without a meaningful performance advantage over Qdrant.
Lesson learned
Do not skip the re-indexing requirement when you upgrade your embedding model. On a project where we upgraded from a 384-dimension model to BGE-M3 (1024 dimensions), the team forgot to re-index the existing vector collection. For three weeks, new documents embedded with the new model lived alongside old documents embedded with the old model — and Qdrant was computing cosine similarity across incompatible embedding spaces. The retrieval quality degradation was silent and gradual. Production monitoring that tracks retrieval score distributions, not just generation faithfulness, would have caught it in hours.
5. Inference engine
Choosing the wrong inference engine is one of the most consequential early decisions in a self-hosted RAG deployment. It determines throughput, latency, hardware utilization, and how painful future model upgrades will be. The deeper engineering treatment — vLLM vs TGI vs TensorRT-LLM, GPU selection, autoscaling — is covered in deploying LLMs to production.
vLLM: the production default
vLLM implements PagedAttention — a memory management scheme for the KV cache that drastically reduces GPU memory fragmentation under concurrent requests. In practice, this means you can serve significantly more concurrent users on the same hardware compared to naive Hugging Face pipeline serving. vLLM exposes an OpenAI-compatible API, so migrating from OpenAI to a self-hosted model requires only changing base_url and api_key in your LangChain or LlamaIndex configuration. It supports fp8, int8, and int4 quantization via bitsandbytes, AWQ, and GPTQ — fp8 on H100 is the sweet spot: minimal quality degradation, ~40% memory reduction vs fp16, high throughput. Production throughput for Llama 3 70B in fp8 on 2x H100: approximately 800–1200 tokens/second for generation with moderate concurrent load (16–32 concurrent requests).
TGI (Text Generation Inference)
Hugging Face's TGI is vLLM's closest competitor. It has slightly better model coverage for edge cases and integrates naturally with the Hugging Face Hub model registry. Performance is broadly comparable to vLLM for most workloads. If your team is already in the Hugging Face ecosystem for model management and fine-tuning, TGI is a reasonable alternative. For greenfield deployments, vLLM's larger community and faster development velocity currently give it the edge.
What not to use in production
Ollama is excellent for local development and low-traffic internal tools. It is not designed for concurrent multi-user load — it processes requests serially by default. Do not deploy Ollama in front of a team of more than 5 concurrent users and expect latency SLAs to hold. The plain Hugging Face transformers pipeline() API has the same problem: no request batching, no KV cache management, no production serving features. Using it in production is the fastest way to saturate your GPU with a single active user.
6. Orchestration layer
The orchestration layer handles query intake, retrieval, context assembly, prompt construction, and response post-processing. In a self-hosted RAG, it also needs to manage the internal service topology — calling your local vLLM endpoint rather than OpenAI, routing to your Qdrant instance, etc.
LangGraph for stateful and multi-step pipelines
LangGraph (LangChain's graph-based orchestration layer) is the right choice for RAG pipelines with conditional logic: query decomposition branches, multi-hop retrieval loops, confidence-gated fallbacks, or human-in-the-loop steps. Its graph model makes state management explicit — you can serialize and inspect the full pipeline state at any node, which is invaluable for debugging retrieval failures. The OpenAI-compatible interface of vLLM means you plug in your local endpoint with a single config change. For complex agentic RAG patterns, see our article on Agentic RAG.
LlamaIndex for document-centric RAG
LlamaIndex has stronger out-of-the-box abstractions for document ingestion, chunking, and index management — it handles the full lifecycle from raw document to indexed vector store more elegantly than LangChain. Its VectorStoreIndex integrates natively with Qdrant, pgvector, and Weaviate. For RAG systems where the primary complexity is in the document pipeline rather than the query routing logic, LlamaIndex is often the cleaner choice.
Custom orchestration
For production systems at scale, we often end up stripping framework abstractions and writing a thin custom orchestration layer. Frameworks add convenience in development but introduce latency overhead and debugging complexity in production. A custom async Python service using httpx for async calls to vLLM and Qdrant, with structlog for structured logging and OpenTelemetry for tracing, is often more maintainable long-term than a deep LangChain dependency tree. The key is building your own abstraction boundary so you can swap components without rewriting business logic.
7. Observability
Observability in a self-hosted RAG is more complex than in an API-based setup because you now own the full stack. You cannot rely on OpenAI's dashboard for token usage, or Pinecone's console for vector search latency. You instrument everything yourself.
The minimum production instrumentation set:
- Per-request traces covering each pipeline stage: query embedding latency, vector search latency (p50/p95/p99), reranker latency (if applicable), LLM generation latency, total end-to-end latency, token counts (input context + generated), retrieved chunk scores and document IDs.
- GPU utilization and memory metrics from vLLM's Prometheus endpoint. Key metrics:
vllm:gpu_cache_usage_perc(KV cache pressure),vllm:num_requests_running, and generation throughput. Set alerts on KV cache usage above 90% — that is where latency degrades sharply. - Retrieval quality metrics: track the distribution of top-1 retrieval scores over time. A downward drift in average top-1 cosine similarity often precedes faithfulness degradation by days or weeks.
- LLM-as-judge eval on production samples: weekly sampling of 50 real queries through your RAGAS or DeepEval pipeline. This is the one signal that cannot be replaced by infrastructure metrics.
Langfuse (open-source, self-hostable) is currently the best option for RAG-specific observability including cost tracking, trace visualization, and eval integration. Phoenix (Arize, open-source) is a strong alternative with better out-of-the-box embedding visualization. Both export to OpenTelemetry-compatible backends if you want to consolidate into an existing observability stack.
8. Cost model and break-even
This is the section people get wrong most often — usually because they compare the wrong things. You should not compare self-hosted GPU cost to API token cost in isolation. You need to account for: GPU reserved vs spot pricing, amortized engineering overhead, ops burden, and the fully-loaded cost of model evaluation and maintenance. The table below uses conservative estimates.
| Monthly token volume | GPT-4o API cost | Self-hosted cost (Llama 3 70B, 2x H100 reserved) | Delta |
|---|---|---|---|
| 10M tokens | ~$25 | ~$1,800 (GPU + ops) | API cheaper by ~72x |
| 100M tokens | ~$250 | ~$1,800 | API cheaper by ~7x |
| 500M tokens | ~$1,250 | ~$2,200 (2nd GPU node needed) | Near parity |
| 1B tokens | ~$2,500 | ~$2,500 | Break-even |
| 5B tokens | ~$12,500 | ~$3,500 | Self-hosted cheaper by ~3.5x |
Assumptions: GPT-4o at $2.50/1M input + $10/1M output, blended at 80/20 input/output ratio. Self-hosted: 2x H100 80GB reserved at ~$2.80/hr/GPU ($4,030/month), plus $300/month ops overhead (vector DB, object storage, monitoring). GPU costs vary significantly by cloud provider and region.
A few things this table does not capture:
- Engineering time to set up and maintain the self-hosted stack: budget 2–4 weeks of initial setup and 1–2 days per month of ongoing maintenance. That is a real cost.
- Model quality gap: if self-hosted Llama 3 70B has 5 points lower faithfulness than GPT-4o on your workload, that gap has a dollar value in user-facing errors. Measure it on your actual corpus before assuming parity.
- The embedding cost is separate. BGE-M3 on a single A10G (~$0.60/hr) handles embedding comfortably up to high volume. OpenAI text-embedding-3-small at $0.02/1M tokens is so cheap that self-hosting embeddings is rarely cost-motivated — it is motivated by data residency, not cost.
The practical implication: if your primary driver is cost, self-hosting makes sense at 500M+ tokens per month, decisively at 1B+. If your primary driver is compliance or IP control, the break-even analysis is secondary — you are paying for removal of a constraint, not for cheaper tokens.
9. Trade-offs and failure modes
Self-hosting is not free of failure. Beyond the standard RAG failure modes documented in our production RAG article, there is a set of failure modes specific to self-hosted deployments.
GPU memory fragmentation under load
vLLM's PagedAttention is designed to minimize this, but you can still hit KV cache exhaustion under high concurrent load if your context windows are large. Tuning gpu_memory_utilization (default 0.9 — lower this to 0.85 if you see OOM errors) and max_num_batched_tokens (aligned to your hardware and typical context window size) resolves most of these issues. Monitor vllm:gpu_cache_usage_perc continuously.
Model versioning and rollback
API providers handle model versioning for you. Self-hosted means you own this. Pin your model weights to a specific revision in your model registry (use Hugging Face Hub revision hashes or your own S3-based registry). When you upgrade the base model, run your full eval suite before promoting to production. Define a rollback procedure — a simple shell script that restarts vLLM with the previous model path — and test it before you need it.
Embedding-generator mismatch after updates
If you update your embedding model and re-index without updating the generator's system prompt or retrieval configuration, or conversely update the generator without re-verifying retrieval quality, you will see silent quality degradation. Treat the embedding model and the vector index as a coupled versioned artifact. When either changes, re-run your full retrieval eval set before promoting.
The document parsing bottleneck
The weakest link in most self-hosted RAG deployments is not the LLM or the vector store — it is document parsing. PDF extraction quality determines the ceiling of your retrieval quality. Complex PDFs with multi-column layouts, embedded tables, and scanned pages require dedicated parsing infrastructure. Docling (IBM, Apache 2.0) and Unstructured (open-source tier) are the current best options for production-grade PDF parsing. Allocate real engineering time here — under-investing in parsing quality while over-engineering inference is the single most common mistake we see on new self-hosted RAG projects.
Ops burden is not zero
GPU servers require monitoring, patching, and capacity planning. Model updates require evaluation pipelines. Vector indexes require maintenance (re-indexing, index optimization). This is somewhere between 0.5 and 1 FTE of ongoing infrastructure work, depending on the complexity of your deployment. If you do not have that capacity, a managed deployment on a compliant cloud provider — using open-weight models on AWS SageMaker, Azure ML, or OVHcloud AI Deploy — is a valid middle ground that addresses data residency without the full ops burden of bare-metal self-hosting.
10. Reference architecture
The following table summarizes the component choices for a production self-hosted RAG system across three scale tiers. These are starting points, not prescriptions — benchmark against your own corpus and query distribution before committing.
| Component | Small (internal team, <50 users) | Medium (product feature, 50–500 users) | Large (>500 users, >500M tokens/month) |
|---|---|---|---|
| LLM | Mistral Small 3 (24B) | Llama 3 70B fp8 | Llama 3 70B fp8 (multi-node) or DeepSeek-V3 for reasoning tasks |
| GPU | 1x L40S (48GB) | 2x H100 80GB | 4x H100 80GB per node, autoscaled |
| Inference engine | vLLM or TGI | vLLM | vLLM with tensor parallelism + load balancer |
| Embedding model | BGE-M3 (CPU batch) | BGE-M3 (A10G) | BGE-M3 or E5-Mistral-7B (dedicated GPU) |
| Reranker | Optional — BGE-Reranker-v2-M3 | BGE-Reranker-v2-M3 async | BGE-Reranker-v2-M3 batched async |
| Vector DB | pgvector (if PostgreSQL exists) or Qdrant single node | Qdrant single node | Qdrant cluster (Kubernetes StatefulSet) |
| Orchestration | LlamaIndex | LangGraph or custom async Python | Custom async Python + LangGraph for complex flows |
| Observability | Langfuse (self-hosted) | Langfuse + Prometheus/Grafana | Langfuse + Prometheus + OpenTelemetry + weekly eval pipeline |
| Infrastructure | Single VM or bare-metal | Docker Compose or lightweight K8s | Kubernetes with GPU node pools, HPA |
| Approx. infra cost/month | $600–900 | $2,000–3,000 | $6,000–15,000+ |
A note on Kubernetes: for the small tier, Kubernetes adds more operational overhead than value. Docker Compose or a simple systemd-managed vLLM process is easier to maintain and debug. Only move to Kubernetes when you need autoscaling, multi-node GPU scheduling, or you already run a K8s cluster for other workloads.
Evaluating a self-hosted RAG deployment?
We help engineering teams design, audit, and deploy production RAG systems — API-based or fully self-hosted. We do not recommend self-hosting unless the numbers and constraints justify it.
Frequently asked questions
For GPT-4o at $2.50/1M input tokens, self-hosting a Llama 3 70B stack (2x H100 at ~$4/hour reserved, vLLM serving ~800 tokens/second) becomes cheaper at roughly 50–80M tokens per month in compute terms. Including engineering overhead, the real break-even is closer to 500M–1B tokens/month. Below those volumes, compliance and control are the real reasons to self-host, not cost savings.
Llama 3 70B (fp8 via vLLM) is the most battle-tested default: strong instruction following, 128K context window, runs on 2x H100. Mistral Small 3 (24B) is the single-GPU pragmatist for moderate-volume workloads and strong European-language performance. Qwen 2.5 72B outperforms on multilingual and coding-heavy corpora. Avoid 8B models for complex multi-document reasoning — the hallucination rate increase is measurable and user-facing.
Qdrant is the production default for new self-hosted deployments: Rust-native, supports HNSW with quantization, native hybrid search, scales to 100M+ vectors. pgvector is right when you already run PostgreSQL and stay under ~5M vectors — zero new infrastructure, transactional consistency with metadata. Weaviate adds value for multi-tenant or multi-modal filtering. Avoid Chroma in production — it lacks the durability guarantees of the other three.
Neither GDPR nor HIPAA mandates on-premise hosting per se, but both impose constraints that self-hosting makes structurally simpler to meet. Under GDPR, transferring personal data to US-based processors requires an adequacy decision or SCCs — and the legal basis remains challenged. Under HIPAA, commercial LLM APIs raise questions about BAA scope and training data use. Self-hosting removes these dependencies entirely. For HIPAA-covered entities processing PHI through an LLM, on-premise or private cloud deployment is typically the lowest-risk path.
vLLM is the production default. It implements PagedAttention for high-throughput batched inference, supports fp8 and int4 quantization, and exposes an OpenAI-compatible API. TGI (Hugging Face) is a solid alternative for teams already in the HF ecosystem. Ollama is fine for local development and very low traffic, but has no batching for concurrent users. Never use raw Hugging Face pipeline in production — no request batching, no KV cache management.
Three failure modes appear most often: (1) GPU KV cache exhaustion under concurrent load — tune vLLM's gpu_memory_utilization and monitor the cache_usage_perc metric; (2) model versioning without a rollback strategy — pin weights to a specific registry revision and test rollback before you need it; (3) embedding-generator mismatch after updates — if you update your embedding model without re-indexing, retrieval quality degrades silently. Treat the embedding model and vector index as a coupled versioned artifact.
Further reading
- RAG: A Technical Guide — How RAG works end-to-end, chunking strategies, vector stores, and when to use RAG vs. fine-tuning.
- Production RAG: 5 failure modes we keep seeing — Retrieval-generation mismatch, eval gaps, latency and cost issues. Applies to self-hosted and API-based RAG alike.
- Vector database comparison — Pinecone vs Qdrant vs Weaviate vs pgvector. Includes managed cloud options and HNSW tuning guidance.
- Embedding models in 2026 — Full guide to embedding model selection, MTEB benchmarks, Matryoshka dimensions, and when fine-tuning pays off.
- Hybrid search and reranking — Dense + sparse retrieval and cross-encoder reranking. The highest-ROI retrieval improvement for most production systems.
- LoRA and QLoRA fine-tuning guide — When and how to fine-tune open-weight models. Relevant when self-hosted RAG quality on specialized corpora needs domain adaptation.
- Deploying LLMs to production — Infrastructure guide covering vLLM, TGI, quantization, autoscaling, and cost modeling in depth.
- Optimize a RAG system: 5 levers — What actually moves recall and faithfulness once your self-hosted stack is up.
- RAG project costs and TCO — Breakdown of capex vs opex when comparing managed vs self-hosted RAG.
- 3 enterprise RAG use cases with measured ROI — Concrete deployment patterns and the numbers behind them.
- Our RAG systems service — Tensoria's end-to-end RAG deployment service, from architecture design to production rollout and eval infrastructure.
- vLLM documentation — Reference documentation for the inference engine used throughout this guide.
- Qdrant documentation — Deployment, HNSW configuration, hybrid search, and quantization guides.
- BGE-M3 on Hugging Face — Model card and usage documentation for the recommended self-hosted embedding model.