When does self-hosting a RAG system make economic sense over using API providers?

The break-even point depends on your token volume and which API you are replacing. For GPT-4o at $2.50/1M input tokens, self-hosting a Llama 3 70B stack (2x H100 at ~$4/hour reserved, vLLM serving ~800 tokens/second) becomes cheaper at roughly 50–80M tokens per month. For Claude Sonnet-class models, the break-even is similar. For embedding at scale — OpenAI text-embedding-3-small at $0.02/1M tokens — self-hosted BGE-M3 on a single A10G becomes cheaper around 2–3B tokens per month. Below those volumes, compliance and control are the real reasons to self-host, not cost.

What is the best self-hosted vector database for RAG in production?

Qdrant is the strongest default for new self-hosted deployments: written in Rust, supports HNSW with quantization, native hybrid search (dense + sparse BM25), and scales to hundreds of millions of vectors on reasonable hardware. pgvector is the right choice when you already run PostgreSQL and your vector corpus stays under ~5M vectors — zero new infrastructure, transactional consistency with your metadata. Weaviate adds value when your retrieval logic needs complex multi-tenant filtering or multi-modal search. Avoid Chroma in production — it lacks the durability and performance guarantees of the other three.

What inference engine should I use to serve an open-weight LLM for RAG?

vLLM is the production default. It implements PagedAttention for high-throughput batched inference, supports fp8 and int4 quantization via bitsandbytes and GPTQ, and has an OpenAI-compatible API so you can swap providers with a single base_url change. TGI (Text Generation Inference by Hugging Face) is a solid alternative, especially for teams already in the Hugging Face ecosystem. Ollama is useful for local dev and low-traffic internal tools but is not designed for multi-user concurrent load. Do not use plain Hugging Face transformers pipeline for production serving — it has no request batching.

What are the main failure modes specific to self-hosted RAG that do not exist with API-based RAG?

Three failure modes appear repeatedly in self-hosted setups: (1) GPU memory fragmentation under concurrent load — vLLM's PagedAttention mitigates this but you still need to tune max_num_batched_tokens and gpu_memory_utilization carefully; (2) model drift without a rollback strategy — API providers handle model versioning, you have to pin weights and version your model registry yourself; (3) embedding model / generator mismatch after updates — if you update your embedding model without re-indexing your vector store, retrieval quality degrades silently. These are solvable engineering problems, not blockers, but they require explicit planning.

Self-Hosted RAG Architecture: Open Models, Private Deployment, Real Cost

Q: Which open-weight LLM is best for self-hosted RAG in 2026?

For most enterprise RAG workloads, Llama 3 70B (in fp8 via vLLM) is the default recommendation: strong instruction following, good context window, runs on 2x H100 or 4x A100 80GB. Mistral Small 3 (24B) is the pragmatic single-GPU option — one L40S, 1–3s latency, covers 80% of Q&A and summarization use cases. Qwen 2.5 72B outperforms Llama on multilingual and coding-heavy corpora. DeepSeek-V3 is compelling on reasoning benchmarks but has higher memory requirements. Avoid running 8B models for complex RAG — hallucination rates on multi-document reasoning are meaningfully higher than 70B+ models.

The default RAG stack in 2026 is still: OpenAI embeddings, a managed vector store, and GPT-4o or Claude for generation. That stack has low operational overhead, strong baseline performance, and zero GPU procurement headaches. For a large portion of use cases, it is the right call.

There is a different set of use cases where that default creates real problems: regulated industries where data residency is legally constrained, high-volume deployments where per-token API cost compounds to six figures annually, and IP-sensitive verticals where sending proprietary documents through a commercial inference endpoint is not a policy you can sign off on. For those situations, a fully self-hosted RAG architecture — open-weight LLM, self-managed vector store, on-premise inference — is not ideological but practical.

This guide covers the engineering decisions involved: which models to run, which inference stack to use, which vector databases hold up in production, how the cost model actually works, and where self-hosted deployments fail. If you want the foundations of RAG before going further, start with our RAG technical guide. If you already have a production RAG system and it is underperforming, read Production RAG: 5 failure modes we keep seeing first.

1. When self-hosting is actually justified

Self-hosting a RAG system means running your own inference servers, managing GPU capacity, operating your own vector database, and taking on the engineering overhead of a small internal ML platform team. That is a meaningful cost — not just in dollars, but in time, expertise, and operational risk. Do not do it unless you have a clear reason.

There are four situations where that trade-off holds up:

Compliance and data residency requirements

HIPAA. Sending PHI (protected health information) through a commercial LLM API requires a signed Business Associate Agreement. AWS, Azure, and GCP offer BAAs. Most LLM API providers do not, or their BAAs exclude model training data use in ways that create audit ambiguity. Healthcare organizations running clinical document retrieval — discharge summaries, clinical notes, lab results — typically cannot accept that ambiguity. On-premise inference eliminates the dependency.

GDPR and EU AI Act. Under GDPR, transferring personal data to US-based processors requires either an adequacy decision or Standard Contractual Clauses. The legal basis for transfers to US cloud providers has been challenged repeatedly. The EU AI Act, in progressive enforcement through 2026, adds transparency and traceability obligations that are structurally simpler to meet when you control the full inference chain. Organizations handling EU citizen data in insurance, legal, or financial services increasingly require in-region processing.

Sovereign cloud customers. Defense primes, government contractors, and critical infrastructure operators often have contractual or regulatory requirements specifying that data processing must remain within a defined perimeter — on-premise, national cloud, or classified network. Commercial API endpoints are categorically out of scope for these environments.

IP and confidentiality sensitivity

A RAG system over your internal knowledge base is, by design, exposing your most sensitive documents to the inference pipeline. For most enterprises, the commercial API terms of major providers are clear enough: data is not used for training. But in practice, legal teams at IP-intensive companies — pharmaceutical R&D, semiconductor design, M&A advisory — may not be comfortable accepting those terms as sufficient protection. Self-hosting removes the question entirely.

Cost at scale

API pricing makes sense at low volume and for prototypes. It stops making sense once token throughput compounds. The break-even analysis is in the cost model section below. The short version: for most setups, self-hosting becomes cheaper than GPT-4o-class APIs somewhere between 50M and 200M tokens per month. For embedding specifically, the crossover with OpenAI's cheapest model happens later — around 2–3B tokens per month — because their pricing is already very low.

Provider independence

This one is less dramatic but operationally real. API providers can change pricing, deprecate model versions, introduce rate limits, or experience outages. If your product SLA depends on LLM inference, building on a single external provider creates a dependency you cannot fully mitigate with retries. Running your own inference stack gives you version pinning, predictable throughput, and the ability to roll back to a previous model checkpoint if a new one regresses your evaluation metrics.

Lesson learned

We audited a legal-tech platform that had been running on OpenAI's API for 18 months. Their monthly token bill hit $40K when they expanded to a second customer segment. The team had always assumed they would "migrate later when it made sense." The migration cost them four months of infrastructure work because they had not designed for provider swappability from the start. The lesson: even if you start with an API, architect for swappability — an OpenAI-compatible interface layer costs almost nothing upfront and saves you from a painful migration later.

2. Open-weight LLM selection

The open-weight LLM landscape has genuinely converged to frontier-class quality at 70B parameter scale. The gap between GPT-4o and a well-quantized Llama 3 70B on standard RAG workloads — Q&A over internal documents, summarization, structured extraction — is in the range of 3–7 points on faithfulness metrics. That gap matters for some use cases and is imperceptible for others. Here is the decision tree we use.

Llama 3 70B: the production default

Meta's Llama 3 70B (and its 3.1/3.3 variants) is the most battle-tested open-weight model for production RAG. It has a strong instruction-following profile, an 128K context window in the 3.1 version, and a large ecosystem of fine-tunes and GGUF quantizations. In fp8 via vLLM on 2x H100 80GB, you get roughly 700–900 tokens per second of generation throughput with batch inference, which translates to sub-2-second P95 latency at moderate concurrent load.

Hardware requirement: 2x H100 80GB (fp8), or 4x A100 80GB (fp16). At H100 spot pricing of $2–3/hour per GPU, expect $4–6/hour for the inference cluster. On reserved 1-year pricing, that drops to roughly $2.50–3.50/hour total.

Mistral Small 3 (24B): the single-GPU pragmatist

Mistral Small 3 (24B parameters, Apache 2.0) runs on a single L40S or A100 80GB in fp16. It covers the vast majority of RAG use cases — factual Q&A, document summarization, procedure lookups — with 1–3 second latency and good multilingual performance. If your workload is moderate-volume and your primary language is not English, Mistral Small is often the better choice than Llama 3 70B because its European-language quality is stronger relative to its size. The 256K context window in Mistral Small 4 is also genuinely useful for long-document RAG where you want to inject more retrieved chunks without truncation.

Qwen 2.5 72B: for multilingual and code-heavy corpora

Qwen 2.5 72B (Alibaba, Apache 2.0) outperforms Llama 3 70B on MMLU and several coding benchmarks, and has stronger multilingual coverage across East Asian languages. If your document corpus mixes English with Chinese, Japanese, or Korean, or if your retrieval pipeline over codebases is a primary use case, Qwen 2.5 72B is worth benchmarking. Hardware requirements are similar to Llama 3 70B.

DeepSeek-V3: strong on reasoning, heavier on memory

DeepSeek-V3 (671B parameters total, MoE architecture with ~37B active) benchmarks exceptionally well on reasoning-heavy tasks — multi-document synthesis, contract analysis with interdependent clauses, financial report cross-referencing. However, the full model requires 8x H100 at minimum, which roughly doubles your infrastructure cost compared to a dense 70B model. For most RAG workloads where retrieval quality matters more than generation reasoning depth, the cost premium is not justified. Reserve DeepSeek-V3 for use cases where you have measured a meaningful gap with 70B-class models on your actual eval set.

What to avoid

Avoid 8B models for production RAG systems handling complex documents. Llama 3 8B and Mistral 7B have meaningfully higher hallucination rates on multi-document reasoning tasks. The infrastructure savings are real but the quality degradation shows up in user-facing errors. Use 8B models for prototyping, for single-document classification tasks, or for latency-critical classification-only sub-components where you have verified quality parity on your eval set.

Lesson learned

A client chose Llama 3 8B "for cost reasons" without benchmarking against their actual document corpus. Their faithfulness score was 0.71 on their internal eval set. Moving to Llama 3 70B in fp8 doubled infrastructure cost but raised faithfulness to 0.88 — which was the threshold where users stopped reporting wrong answers. The 8B model was not saving money; it was creating support tickets.

3. Embedding model selection

Your embedding model determines retrieval quality more than your LLM in most RAG systems. A poorly embedded corpus will starve even the best generator of relevant context. For the full treatment of embedding selection, see our embedding models guide. Here is the self-hosting-specific view.

BGE-M3: the self-hosted default

BGE-M3 (BAAI, MIT license) is the strongest general-purpose self-hostable embedding model as of 2026. It supports dense retrieval, sparse retrieval (lexical), and multi-vector (ColBERT-style) retrieval within a single model checkpoint, which means you can run hybrid search without deploying a separate sparse encoder. MTEB average retrieval score of ~54–56 across benchmarks. Runs on CPU for batch indexing (slow but functional), and on a single A10G or T4 GPU for real-time embedding at reasonable latency.

Key parameter: BGE-M3 produces 1024-dimensional dense vectors. If memory is a constraint in your vector store, consider truncating to 512 dimensions with Matryoshka-aware truncation — the quality degradation is minimal for most corpora.

E5-Mistral-7B-Instruct: for instruction-tuned retrieval

E5-Mistral-7B-Instruct (Microsoft, MIT license) is a 7B-parameter embedding model that uses instruction prefixes to condition retrieval on the query type. It tops several MTEB retrieval subtasks and is a strong choice when your queries are complex and heterogeneous — mixing factual lookups with reasoning-heavy questions. The trade-off: it requires a GPU for any real-time use (too slow on CPU), and at 7B parameters it is considerably larger than BGE-M3, which translates to higher memory footprint and higher batching latency.

NV-Embed-v2 and Voyage open weights

NVIDIA's NV-Embed-v2 ranks at the top of the MTEB leaderboard as of mid-2026. It is available under a research license — check current terms before using in a commercial deployment. Voyage AI's models are API-only (no open weights available at the time of writing). For deployments where you need top-of-leaderboard retrieval quality and can accept the license constraints, NV-Embed-v2 is worth evaluating. For fully open commercial use, BGE-M3 and E5-Mistral remain the cleaner choice.

Cross-encoder rerankers

Regardless of your first-stage embedding model, adding a cross-encoder reranker as a second retrieval stage typically improves precision@3 by 8–15 points on complex queries. BGE-Reranker-v2-M3 (BAAI, open weight) is the self-hosted default. Be aware of latency: a cross-encoder scores query-chunk pairs sequentially and adds 200–800ms per query depending on hardware and batch size. Use async execution or limit reranking to top-20 candidates from first-stage retrieval. This is one of the most common sources of latency regressions we see in production.

4. Self-hosted vector database

All three major self-hostable vector databases are production-ready. The choice depends on your existing infrastructure and query patterns. For a deeper comparison including managed cloud options, see our vector database comparison guide.

Qdrant: the production default for new deployments

Written in Rust, Qdrant is optimized for high-throughput ANN search with low memory overhead via scalar and product quantization. It supports native hybrid search combining dense HNSW with sparse BM25 vectors in a single query — useful for technical domains where exact keyword matching on product codes, contract numbers, or regulation identifiers matters alongside semantic similarity. Qdrant's payload filtering (filter by metadata before or during ANN search) is more performant under high-cardinality filters than Weaviate or pgvector. Deploy as a single Docker container for development, or as a Kubernetes StatefulSet with persistent volume claims for production. Scales to 100M+ vectors on a single node with quantization enabled.

pgvector: when you already run PostgreSQL

If your application already has a PostgreSQL deployment, pgvector is the pragmatic choice for corpora under ~5M vectors. Adding the extension costs nothing, your vectors live in the same transactional boundary as your document metadata, and you can join vector search results directly with relational filters in a single query. The HNSW index in pgvector (added in 0.5.0) closes most of the performance gap with dedicated vector databases at this scale. Beyond 5M vectors, query latency starts to degrade relative to Qdrant unless you tune ef_search and m parameters carefully — and you lose the operational simplicity that justified choosing pgvector in the first place.

Weaviate: for multi-tenant and multi-modal filtering

Weaviate shines when your RAG system serves multiple tenants with strict data isolation requirements, or when your retrieval logic involves combining semantic search with complex GraphQL-style property filters. Its native multi-tenancy model (each tenant gets isolated storage) is cleaner than implementing row-level security on top of Qdrant. If neither multi-tenancy nor multi-modal retrieval is in your requirements, Weaviate adds operational complexity without a meaningful performance advantage over Qdrant.

Lesson learned

Do not skip the re-indexing requirement when you upgrade your embedding model. On a project where we upgraded from a 384-dimension model to BGE-M3 (1024 dimensions), the team forgot to re-index the existing vector collection. For three weeks, new documents embedded with the new model lived alongside old documents embedded with the old model — and Qdrant was computing cosine similarity across incompatible embedding spaces. The retrieval quality degradation was silent and gradual. Production monitoring that tracks retrieval score distributions, not just generation faithfulness, would have caught it in hours.

5. Inference engine

Choosing the wrong inference engine is one of the most consequential early decisions in a self-hosted RAG deployment. It determines throughput, latency, hardware utilization, and how painful future model upgrades will be. The deeper engineering treatment — vLLM vs TGI vs TensorRT-LLM, GPU selection, autoscaling — is covered in deploying LLMs to production.

vLLM: the production default

vLLM implements PagedAttention — a memory management scheme for the KV cache that drastically reduces GPU memory fragmentation under concurrent requests. In practice, this means you can serve significantly more concurrent users on the same hardware compared to naive Hugging Face pipeline serving. vLLM exposes an OpenAI-compatible API, so migrating from OpenAI to a self-hosted model requires only changing base_url and api_key in your LangChain or LlamaIndex configuration. It supports fp8, int8, and int4 quantization via bitsandbytes, AWQ, and GPTQ — fp8 on H100 is the sweet spot: minimal quality degradation, ~40% memory reduction vs fp16, high throughput. Production throughput for Llama 3 70B in fp8 on 2x H100: approximately 800–1200 tokens/second for generation with moderate concurrent load (16–32 concurrent requests).

TGI (Text Generation Inference)

Hugging Face's TGI is vLLM's closest competitor. It has slightly better model coverage for edge cases and integrates naturally with the Hugging Face Hub model registry. Performance is broadly comparable to vLLM for most workloads. If your team is already in the Hugging Face ecosystem for model management and fine-tuning, TGI is a reasonable alternative. For greenfield deployments, vLLM's larger community and faster development velocity currently give it the edge.

What not to use in production

Ollama is excellent for local development and low-traffic internal tools. It is not designed for concurrent multi-user load — it processes requests serially by default. Do not deploy Ollama in front of a team of more than 5 concurrent users and expect latency SLAs to hold. The plain Hugging Face transformers pipeline() API has the same problem: no request batching, no KV cache management, no production serving features. Using it in production is the fastest way to saturate your GPU with a single active user.

6. Orchestration layer

The orchestration layer handles query intake, retrieval, context assembly, prompt construction, and response post-processing. In a self-hosted RAG, it also needs to manage the internal service topology — calling your local vLLM endpoint rather than OpenAI, routing to your Qdrant instance, etc.

LangGraph for stateful and multi-step pipelines

LangGraph (LangChain's graph-based orchestration layer) is the right choice for RAG pipelines with conditional logic: query decomposition branches, multi-hop retrieval loops, confidence-gated fallbacks, or human-in-the-loop steps. Its graph model makes state management explicit — you can serialize and inspect the full pipeline state at any node, which is invaluable for debugging retrieval failures. The OpenAI-compatible interface of vLLM means you plug in your local endpoint with a single config change. For complex agentic RAG patterns, see our article on Agentic RAG.

LlamaIndex for document-centric RAG

LlamaIndex has stronger out-of-the-box abstractions for document ingestion, chunking, and index management — it handles the full lifecycle from raw document to indexed vector store more elegantly than LangChain. Its VectorStoreIndex integrates natively with Qdrant, pgvector, and Weaviate. For RAG systems where the primary complexity is in the document pipeline rather than the query routing logic, LlamaIndex is often the cleaner choice.

Custom orchestration

For production systems at scale, we often end up stripping framework abstractions and writing a thin custom orchestration layer. Frameworks add convenience in development but introduce latency overhead and debugging complexity in production. A custom async Python service using httpx for async calls to vLLM and Qdrant, with structlog for structured logging and OpenTelemetry for tracing, is often more maintainable long-term than a deep LangChain dependency tree. The key is building your own abstraction boundary so you can swap components without rewriting business logic.

7. Observability

Observability in a self-hosted RAG is more complex than in an API-based setup because you now own the full stack. You cannot rely on OpenAI's dashboard for token usage, or Pinecone's console for vector search latency. You instrument everything yourself.

The minimum production instrumentation set:

Per-request traces covering each pipeline stage: query embedding latency, vector search latency (p50/p95/p99), reranker latency (if applicable), LLM generation latency, total end-to-end latency, token counts (input context + generated), retrieved chunk scores and document IDs.
GPU utilization and memory metrics from vLLM's Prometheus endpoint. Key metrics: vllm:gpu_cache_usage_perc (KV cache pressure), vllm:num_requests_running, and generation throughput. Set alerts on KV cache usage above 90% — that is where latency degrades sharply.
Retrieval quality metrics: track the distribution of top-1 retrieval scores over time. A downward drift in average top-1 cosine similarity often precedes faithfulness degradation by days or weeks.
LLM-as-judge eval on production samples: weekly sampling of 50 real queries through your RAGAS or DeepEval pipeline. This is the one signal that cannot be replaced by infrastructure metrics.

Langfuse (open-source, self-hostable) is currently the best option for RAG-specific observability including cost tracking, trace visualization, and eval integration. Phoenix (Arize, open-source) is a strong alternative with better out-of-the-box embedding visualization. Both export to OpenTelemetry-compatible backends if you want to consolidate into an existing observability stack.

8. Cost model and break-even

This is the section people get wrong most often — usually because they compare the wrong things. You should not compare self-hosted GPU cost to API token cost in isolation. You need to account for: GPU reserved vs spot pricing, amortized engineering overhead, ops burden, and the fully-loaded cost of model evaluation and maintenance. The table below uses conservative estimates.

Monthly token volume	GPT-4o API cost	Self-hosted cost (Llama 3 70B, 2x H100 reserved)	Delta
10M tokens	~$25	~$1,800 (GPU + ops)	API cheaper by ~72x
100M tokens	~$250	~$1,800	API cheaper by ~7x
500M tokens	~$1,250	~$2,200 (2nd GPU node needed)	Near parity
1B tokens	~$2,500	~$2,500	Break-even
5B tokens	~$12,500	~$3,500	Self-hosted cheaper by ~3.5x

Assumptions: GPT-4o at $2.50/1M input + $10/1M output, blended at 80/20 input/output ratio. Self-hosted: 2x H100 80GB reserved at ~$2.80/hr/GPU ($4,030/month), plus $300/month ops overhead (vector DB, object storage, monitoring). GPU costs vary significantly by cloud provider and region.

A few things this table does not capture:

Engineering time to set up and maintain the self-hosted stack: budget 2–4 weeks of initial setup and 1–2 days per month of ongoing maintenance. That is a real cost.
Model quality gap: if self-hosted Llama 3 70B has 5 points lower faithfulness than GPT-4o on your workload, that gap has a dollar value in user-facing errors. Measure it on your actual corpus before assuming parity.
The embedding cost is separate. BGE-M3 on a single A10G (~$0.60/hr) handles embedding comfortably up to high volume. OpenAI text-embedding-3-small at $0.02/1M tokens is so cheap that self-hosting embeddings is rarely cost-motivated — it is motivated by data residency, not cost.

The practical implication: if your primary driver is cost, self-hosting makes sense at 500M+ tokens per month, decisively at 1B+. If your primary driver is compliance or IP control, the break-even analysis is secondary — you are paying for removal of a constraint, not for cheaper tokens.

9. Trade-offs and failure modes

Self-hosting is not free of failure. Beyond the standard RAG failure modes documented in our production RAG article, there is a set of failure modes specific to self-hosted deployments.

GPU memory fragmentation under load

vLLM's PagedAttention is designed to minimize this, but you can still hit KV cache exhaustion under high concurrent load if your context windows are large. Tuning gpu_memory_utilization (default 0.9 — lower this to 0.85 if you see OOM errors) and max_num_batched_tokens (aligned to your hardware and typical context window size) resolves most of these issues. Monitor vllm:gpu_cache_usage_perc continuously.

Model versioning and rollback

API providers handle model versioning for you. Self-hosted means you own this. Pin your model weights to a specific revision in your model registry (use Hugging Face Hub revision hashes or your own S3-based registry). When you upgrade the base model, run your full eval suite before promoting to production. Define a rollback procedure — a simple shell script that restarts vLLM with the previous model path — and test it before you need it.

Embedding-generator mismatch after updates

If you update your embedding model and re-index without updating the generator's system prompt or retrieval configuration, or conversely update the generator without re-verifying retrieval quality, you will see silent quality degradation. Treat the embedding model and the vector index as a coupled versioned artifact. When either changes, re-run your full retrieval eval set before promoting.

The document parsing bottleneck

The weakest link in most self-hosted RAG deployments is not the LLM or the vector store — it is document parsing. PDF extraction quality determines the ceiling of your retrieval quality. Complex PDFs with multi-column layouts, embedded tables, and scanned pages require dedicated parsing infrastructure. Docling (IBM, Apache 2.0) and Unstructured (open-source tier) are the current best options for production-grade PDF parsing. Allocate real engineering time here — under-investing in parsing quality while over-engineering inference is the single most common mistake we see on new self-hosted RAG projects.

Ops burden is not zero

GPU servers require monitoring, patching, and capacity planning. Model updates require evaluation pipelines. Vector indexes require maintenance (re-indexing, index optimization). This is somewhere between 0.5 and 1 FTE of ongoing infrastructure work, depending on the complexity of your deployment. If you do not have that capacity, a managed deployment on a compliant cloud provider — using open-weight models on AWS SageMaker, Azure ML, or OVHcloud AI Deploy — is a valid middle ground that addresses data residency without the full ops burden of bare-metal self-hosting.

10. Reference architecture

The following table summarizes the component choices for a production self-hosted RAG system across three scale tiers. These are starting points, not prescriptions — benchmark against your own corpus and query distribution before committing.

Component	Small (internal team, <50 users)	Medium (product feature, 50–500 users)	Large (>500 users, >500M tokens/month)
LLM	Mistral Small 3 (24B)	Llama 3 70B fp8	Llama 3 70B fp8 (multi-node) or DeepSeek-V3 for reasoning tasks
GPU	1x L40S (48GB)	2x H100 80GB	4x H100 80GB per node, autoscaled
Inference engine	vLLM or TGI	vLLM	vLLM with tensor parallelism + load balancer
Embedding model	BGE-M3 (CPU batch)	BGE-M3 (A10G)	BGE-M3 or E5-Mistral-7B (dedicated GPU)
Reranker	Optional — BGE-Reranker-v2-M3	BGE-Reranker-v2-M3 async	BGE-Reranker-v2-M3 batched async
Vector DB	pgvector (if PostgreSQL exists) or Qdrant single node	Qdrant single node	Qdrant cluster (Kubernetes StatefulSet)
Orchestration	LlamaIndex	LangGraph or custom async Python	Custom async Python + LangGraph for complex flows
Observability	Langfuse (self-hosted)	Langfuse + Prometheus/Grafana	Langfuse + Prometheus + OpenTelemetry + weekly eval pipeline
Infrastructure	Single VM or bare-metal	Docker Compose or lightweight K8s	Kubernetes with GPU node pools, HPA
Approx. infra cost/month	$600–900	$2,000–3,000	$6,000–15,000+

A note on Kubernetes: for the small tier, Kubernetes adds more operational overhead than value. Docker Compose or a simple systemd-managed vLLM process is easier to maintain and debug. Only move to Kubernetes when you need autoscaling, multi-node GPU scheduling, or you already run a K8s cluster for other workloads.

Evaluating a self-hosted RAG deployment?

We help engineering teams design, audit, and deploy production RAG systems — API-based or fully self-hosted. We do not recommend self-hosting unless the numbers and constraints justify it.

Book a technical call

Frequently asked questions

When does self-hosting a RAG system make economic sense?

For GPT-4o at $2.50/1M input tokens, self-hosting a Llama 3 70B stack (2x H100 at ~$4/hour reserved, vLLM serving ~800 tokens/second) becomes cheaper at roughly 50–80M tokens per month in compute terms. Including engineering overhead, the real break-even is closer to 500M–1B tokens/month. Below those volumes, compliance and control are the real reasons to self-host, not cost savings.

Which open-weight LLM is best for self-hosted RAG in 2026?

Llama 3 70B (fp8 via vLLM) is the most battle-tested default: strong instruction following, 128K context window, runs on 2x H100. Mistral Small 3 (24B) is the single-GPU pragmatist for moderate-volume workloads and strong European-language performance. Qwen 2.5 72B outperforms on multilingual and coding-heavy corpora. Avoid 8B models for complex multi-document reasoning — the hallucination rate increase is measurable and user-facing.

What is the best self-hosted vector database for RAG?

Qdrant is the production default for new self-hosted deployments: Rust-native, supports HNSW with quantization, native hybrid search, scales to 100M+ vectors. pgvector is right when you already run PostgreSQL and stay under ~5M vectors — zero new infrastructure, transactional consistency with metadata. Weaviate adds value for multi-tenant or multi-modal filtering. Avoid Chroma in production — it lacks the durability guarantees of the other three.

Does GDPR or HIPAA require on-premise LLM hosting?

Neither GDPR nor HIPAA mandates on-premise hosting per se, but both impose constraints that self-hosting makes structurally simpler to meet. Under GDPR, transferring personal data to US-based processors requires an adequacy decision or SCCs — and the legal basis remains challenged. Under HIPAA, commercial LLM APIs raise questions about BAA scope and training data use. Self-hosting removes these dependencies entirely. For HIPAA-covered entities processing PHI through an LLM, on-premise or private cloud deployment is typically the lowest-risk path.

What inference engine should I use for self-hosted LLM serving?

vLLM is the production default. It implements PagedAttention for high-throughput batched inference, supports fp8 and int4 quantization, and exposes an OpenAI-compatible API. TGI (Hugging Face) is a solid alternative for teams already in the HF ecosystem. Ollama is fine for local development and very low traffic, but has no batching for concurrent users. Never use raw Hugging Face pipeline in production — no request batching, no KV cache management.

What failure modes are specific to self-hosted RAG?

Three failure modes appear most often: (1) GPU KV cache exhaustion under concurrent load — tune vLLM's gpu_memory_utilization and monitor the cache_usage_perc metric; (2) model versioning without a rollback strategy — pin weights to a specific registry revision and test rollback before you need it; (3) embedding-generator mismatch after updates — if you update your embedding model without re-indexing, retrieval quality degrades silently. Treat the embedding model and vector index as a coupled versioned artifact.

Self-Hosted RAG Architecture: Open Models, Private Deployment, Real Cost

1. When self-hosting is actually justified

Compliance and data residency requirements

IP and confidentiality sensitivity

Cost at scale

Provider independence

2. Open-weight LLM selection

Llama 3 70B: the production default

Mistral Small 3 (24B): the single-GPU pragmatist

Qwen 2.5 72B: for multilingual and code-heavy corpora

DeepSeek-V3: strong on reasoning, heavier on memory

What to avoid

3. Embedding model selection

BGE-M3: the self-hosted default

E5-Mistral-7B-Instruct: for instruction-tuned retrieval

NV-Embed-v2 and Voyage open weights

Cross-encoder rerankers

4. Self-hosted vector database

Qdrant: the production default for new deployments

pgvector: when you already run PostgreSQL

Weaviate: for multi-tenant and multi-modal filtering

5. Inference engine

vLLM: the production default

TGI (Text Generation Inference)

What not to use in production

6. Orchestration layer

LangGraph for stateful and multi-step pipelines

LlamaIndex for document-centric RAG

Custom orchestration

7. Observability

8. Cost model and break-even

9. Trade-offs and failure modes

GPU memory fragmentation under load

Model versioning and rollback

Embedding-generator mismatch after updates

The document parsing bottleneck

Ops burden is not zero

10. Reference architecture

Frequently asked questions

Further reading