The embedding model is the most underexamined component in most RAG stacks. Teams spend weeks debating chunking strategies and reranker choices, then plug in text-embedding-ada-002 because it was in the tutorial. In 2026, that decision deserves more scrutiny. The landscape has shifted significantly: open-source models now match or exceed closed-source performance on retrieval tasks, Matryoshka training has made dimension trade-offs practical, and BGE-M3 has changed what "multilingual" actually means in production.
This is an engineering guide, not a benchmark summary. I will give you the mental model to evaluate models for your specific use case, a clear view of where the MTEB leaderboard misleads, and a practical fine-tuning recipe for the cases where a generic model genuinely is not good enough. If you are new to how embeddings fit into a retrieval pipeline, start with the RAG technical guide first. If you already have a system in production and are trying to diagnose why retrieval is underperforming, the production RAG failure modes article is the companion read.
My opinionated starting position: most teams should start with OpenAI text-embedding-3-small or BGE-large-en-v1.5, measure retrieval quality on a domain-specific eval set, and only invest in fine-tuning when the gap is measurable and the business case justifies the overhead. Everything below is the reasoning and evidence behind that position.
How to read MTEB without getting fooled
The MTEB leaderboard is the most cited benchmark in the embeddings space. It is also the most misread. Before you pick a model based on its MTEB headline score, understand what that number actually represents.
MTEB covers 56 datasets across 7 task categories: retrieval, clustering, classification, pair classification, reranking, semantic textual similarity, and summarization. The headline score is an unweighted average across all of these. If you are building a RAG system for document retrieval — which is the case for 90% of teams reading this article — you care about one category: Retrieval, which covers 15 datasets using nDCG@10 as the metric.
The practical implication: a model with a headline MTEB of 65 but a retrieval sub-score of 55 will lose badly to a model with a headline of 62 and a retrieval sub-score of 60. Always filter to the retrieval sub-leaderboard when comparing models for semantic search and RAG.
Three other traps to avoid:
- Eval contamination. MTEB datasets are public. Models trained after 2023 may have seen BEIR corpora (which back MTEB's retrieval tasks) during pretraining or post-training. The leaderboard does not flag this. A model scoring 72 on MS MARCO passage retrieval may have seen that corpus during instruction tuning. In practice, test every model you are seriously considering against a held-out slice of your own corpus before committing.
- Domain mismatch. MTEB's retrieval suite is dominated by general web-corpus data: MS MARCO, TREC, NFCorpus. If your knowledge base is legal contracts, scientific PDFs, or internal SaaS documentation, MTEB retrieval scores are a directional signal at best. The model that leads MTEB may rank third on your actual data.
- Sequence length cliff. Many MTEB retrieval datasets use short passages (under 256 tokens). Models with limited context windows or degraded quality beyond 512 tokens will score well on MTEB but fail on long-document retrieval. Check the maximum sequence length and verify performance doesn't degrade on your actual document length distribution.
Lesson learned
On a legal contract retrieval system we built for a mid-size law firm, the MTEB leaderboard top-3 models ranked 5th, 7th, and 2nd on our in-domain eval set of 200 queries. The model that won on our data — BGE-large-en-v1.5 — was ranked 11th on the MTEB leaderboard at the time. MTEB is a useful prior, not a decision oracle.
Closed-source vs open-source trade-offs
The closed vs open decision in 2026 is not primarily a performance decision — it is an operational and compliance decision. Open-source models now match closed-source on retrieval quality across most domains. The choice comes down to five factors.
Data residency and compliance. If you are building for a regulated industry (healthcare, finance, legal in the EU) or handling proprietary internal documents, sending document chunks to OpenAI or Cohere API endpoints may violate data residency requirements or internal security policy. Open-source self-hosted models eliminate this problem entirely. This is the single most common reason we recommend BGE over OpenAI in enterprise settings.
Operational overhead. Serving an embedding model in production requires GPU infrastructure, autoscaling, monitoring, and model version management. A team of two engineers should not be managing GPU autoscaling when they can call an API for $0.02/1M tokens. The crossover point where self-hosting becomes economically rational is roughly 500M–1B tokens per month, assuming you have existing MLOps infrastructure.
Latency. A well-provisioned self-hosted BGE-large on a T4 GPU handles 200–400 requests per second with 10–20ms latency per request. The OpenAI API at default concurrency is roughly 50–150ms per request depending on batch size and load. For real-time search serving at <100ms end-to-end, self-hosting has a latency advantage once you account for network round-trips.
Fine-tunability. You cannot fine-tune OpenAI's embedding models. Voyage AI and Cohere offer custom fine-tuning as an enterprise service, which costs more and has limited iteration speed. Open-source models are fully fine-tunable with a few hundred labeled examples using sentence-transformers — and if you want to adapt the generator at the same time, the engineering side of LoRA and QLoRA applies directly.
Model lock-in and embedding stability. OpenAI deprecated text-embedding-ada-002 with a migration window. If your vector store contains 50M embeddings, model deprecation is not a minor inconvenience. Plan for embedding version lifecycle from day one — either pin to a self-hosted open-source model or design your re-indexing pipeline before you need it.
Model recommendations by use case
English-only knowledge base (general text, documentation, support tickets). Start with text-embedding-3-small. At 1536 dimensions and $0.02/1M tokens, it is the lowest-friction option for teams that want to move fast. If you need a marginal quality improvement and budget allows, text-embedding-3-large (3072 dim, $0.13/1M) gains roughly 2–4 nDCG points on retrieval benchmarks. For self-hosted English retrieval, BGE-large-en-v1.5 (335M parameters, 1024 dim) is the default choice — competitive with text-embedding-3-small at zero ongoing token cost once deployed.
Multilingual corpus. BGE-M3 is the clearest recommendation here. It covers 100+ languages with a single model, supports 8192-token context, and uniquely produces dense, sparse, and multi-vector ColBERT embeddings in a single forward pass. For API-based multilingual, Cohere Embed v4 (with its 256-language support and input type routing) is strong, and Voyage AI voyage-3 performs well on non-English retrieval.
Code retrieval. Voyage AI's voyage-code-3 is purpose-built for mixed natural language + code queries. It handles function signatures, docstrings, and code comment retrieval significantly better than general-purpose models. If you are building a codebase assistant or documentation search over a developer-facing product, this is a better starting point than adapting a text-only model.
Scientific and academic text. NV-Embed-v2 from NVIDIA (7B parameters) has shown strong performance on scientific retrieval tasks including BEIR's SciFact and NFCorpus datasets. It requires a GPU with at least 16GB VRAM and is not a practical choice for teams without GPU infrastructure, but for dedicated scientific search systems it is worth benchmarking. E5-mistral-7B-instruct is the other 7B option worth evaluating in this category.
Hybrid retrieval (dense + sparse). BGE-M3's hybrid mode is the most practical production path here. It replaces three separate models — a dense encoder, BM25, and a reranker — with a single model whose outputs feed directly into a hybrid retrieval pipeline. The alternative is running a dense model alongside a BM25 index and fusing scores via RRF or linear interpolation, which adds infrastructure complexity without necessarily improving quality.
Model comparison table
| Model | Provider | Params | Dim | MTEB Retrieval | Max seq | Cost |
|---|---|---|---|---|---|---|
| text-embedding-3-small | OpenAI | n/a | 1536 | ~55 | 8191 | $0.02/1M |
| text-embedding-3-large | OpenAI | n/a | 3072 | ~59 | 8191 | $0.13/1M |
| embed-v4 | Cohere | n/a | 1024 | ~57 | 512 | $0.10/1M |
| voyage-3 | Voyage AI | n/a | 1024 | ~62 | 32768 | $0.06/1M |
| BGE-large-en-v1.5 | BAAI (open) | 335M | 1024 | ~55 | 512 | Self-hosted |
| BGE-M3 | BAAI (open) | 568M | 1024 | ~58 | 8192 | Self-hosted |
| E5-mistral-7B-instruct | Microsoft (open) | 7B | 4096 | ~62 | 32768 | Self-hosted (A100) |
| NV-Embed-v2 | NVIDIA (open) | 7B | 4096 | ~64 | 32768 | Self-hosted (A100) |
| Stella-en-1.5B-v5 | NoInstruct (open) | 1.5B | 8192 | ~63 | 131072 | Self-hosted (T4) |
| GTE-Qwen2-7B-instruct | Alibaba (open) | 7B | 3584 | ~63 | 32768 | Self-hosted (A100) |
MTEB Retrieval scores are approximate nDCG@10 averages on the retrieval sub-leaderboard as of Q1 2026. Verify against the live leaderboard for your specific use case. Self-hosting costs vary by GPU type and cloud region.
Dimensions, Matryoshka, and storage trade-offs
The dimensionality of your embedding vector is a direct multiplier on three production costs: storage, retrieval latency, and memory footprint for approximate nearest neighbor (ANN) indices. The relationship is roughly linear for storage, slightly sublinear for retrieval latency (because ANN algorithms like HNSW scale with log of index size, not raw dimension count), and linear for memory.
Concrete numbers for 10M vectors:
- 3072 dimensions (float32): ~117 GB raw, ~30–40 GB with HNSW index at ef_construction=200
- 1536 dimensions: ~58 GB raw, ~15–20 GB indexed
- 768 dimensions: ~29 GB raw, ~8–12 GB indexed
- 256 dimensions: ~10 GB raw, ~3–4 GB indexed
The retrieval quality curve flattens quickly after 768 dimensions for most tasks. Going from 256 to 768 gives meaningful recall gains. Going from 1536 to 3072 gives marginal gains at roughly 6x the storage cost increase. For most RAG applications, 768 or 1024 dimensions is the practical sweet spot.
Matryoshka Representation Learning changes this calculus. The original paper (Kusupati et al., NeurIPS 2022) proposed training models to encode information at multiple granularities simultaneously. Instead of training a model to produce a single fixed-size embedding, MRL applies the loss function at multiple truncation points — 64, 128, 256, 512, 768, 1536 — during training. This incentivizes the model to front-load the most semantically important information into the first dimensions of the vector.
The practical result: you can truncate a 1536-dimensional MRL embedding to 256 dimensions and retain approximately 93–95% of the retrieval quality, at one-sixth the storage and significantly faster dot-product computation. Both text-embedding-3-small and text-embedding-3-large were trained with MRL. You can pass a dimensions parameter to the OpenAI API to get any size embedding you want up to the model maximum.
The recommended two-stage retrieval pattern leveraging MRL: use 256-dim truncated vectors for the initial ANN pass across 100M documents, then re-embed and re-score the top-500 candidates at full 1536 dimensions before reranking. This reduces ANN index memory by 6x while preserving final answer quality. We cover the full reranking architecture in the upcoming hybrid search and reranking guide.
Lesson learned
We migrated a customer from 1536-dim OpenAI embeddings to 256-dim MRL-truncated embeddings in Qdrant. Their 40M-vector index shrank from 23 GB to 4 GB, query latency fell from 28ms to 11ms median, and retrieval nDCG@10 dropped by 0.8 points on their eval set — a trade-off they accepted immediately given the infrastructure savings. The key was validating the quality drop on their data before committing to the migration.
Multi-vector embeddings: ColBERT and PLAID
All the models discussed so far produce a single vector per text chunk — a dense embedding that compresses the entire semantic content into one fixed-size representation. This is the dominant paradigm, but it has a structural limitation: compressing a 512-token document into a 1024-dimensional vector necessarily loses token-level information. A query about a specific named entity, a precise date, or an exact product code may fail to retrieve the correct document if that signal is diluted in the dense representation.
ColBERT (Contextualized Late Interaction over BERT), introduced by Khattab and Zaharia, addresses this by producing one embedding vector per token in the document, not one per document. At query time, the similarity between a query and a document is computed as the sum of maximum similarity scores between each query token and any document token — the MaxSim operation. This late interaction pattern preserves token-level precision while remaining more efficient than full cross-attention reranking.
PLAID is the production-efficient serving system for ColBERT, using centroid-based compression to reduce storage from O(tokens × dim) — which would be prohibitively large — to something approaching O(documents × dim) through aggressive quantization.
The practical question is whether the added complexity is worth it. Our assessment:
- Use ColBERT/PLAID when: your queries are precise (named entities, codes, specific terminology) and your documents are long (500+ tokens per chunk). The token-level MaxSim genuinely outperforms dense retrieval in these settings.
- Use dense + reranker instead when: your queries are semantic and exploratory ("what is our policy on X?"), your documents are short to medium length, and you want simpler infrastructure. A dense model followed by a cross-encoder reranker recovers most of ColBERT's quality advantage with a more standard stack.
- BGE-M3 hybrid mode gives you a pragmatic middle path: the model produces ColBERT-style token vectors alongside dense and sparse embeddings in a single forward pass. You get the option to use multi-vector scoring without running a separate ColBERT encoder, at the cost of higher per-inference compute.
- For images, PDFs with figures, and tables: late interaction over the visual tokens of a vision-language model (ColPali, ColQwen2) is now the right default — see our multimodal RAG guide for the full architecture.
For the vector database side of this architecture, see our forthcoming vector database comparison, which covers which databases support multi-vector indexing natively (Weaviate, Qdrant, and Vespa do; Pinecone and Chroma do not as of this writing).
When domain fine-tuning pays off
Fine-tuning an embedding model is not the answer to every retrieval problem. It is a meaningful investment — labeled data collection, training infrastructure, model evaluation, deployment — and it only pays off in specific conditions. I have watched teams spend three weeks fine-tuning when a better chunking strategy or a cross-encoder reranker would have solved the same problem in two days.
Fine-tuning is worth it when at least one of these is true:
- Domain vocabulary gap. Your corpus uses terminology that appears rarely or never in generic pretraining data: proprietary product names, internal project codes, legal citation formats (e.g., "Article 17 of Regulation (EU) 2016/679"), ICD-10 medical codes, or highly specialized scientific nomenclature. Generic models cannot learn good representations for tokens they barely saw during pretraining. Fine-tuning teaches the model to cluster these domain-specific tokens correctly.
- Multilingual code-mixing. Your users write queries mixing two languages — French technical questions with English product names, Arabic queries with English keywords in a bilingual SaaS context. Standard multilingual models are trained on clean single-language text and handle code-mixed queries poorly. A fine-tuned model with domain-specific bilingual pairs closes this gap meaningfully.
- Measurable retrieval gap that reranking doesn't fix. You have a domain-specific eval set, your retrieval nDCG@10 is below 0.70, you have already tried adding a cross-encoder reranker, and the gap persists. At this point the dense encoding is the bottleneck and fine-tuning is the next lever.
Fine-tuning is probably not worth it when:
- Your retrieval nDCG@10 on domain data is already above 0.80 — incremental gains from fine-tuning will be small.
- You do not have a domain-specific eval set. You cannot measure whether fine-tuning helped, which means you cannot know when to stop or whether you've overfit. See building custom LLM judges for how to construct one that correlates with downstream RAG quality.
- Your corpus is general business text (emails, meeting notes, standard documentation) well-covered by generic pretraining.
- You have fewer than 500 labeled query-document pairs. Below this threshold, overfitting risk is high and gains are unreliable.
Lesson learned
A client in the insurance sector pushed us to fine-tune before we had an eval set. We resisted, built the eval set first (150 query/document pairs from real broker conversations), and discovered that BGE-large-en-v1.5 out-of-the-box achieved 0.76 nDCG@10 on their data — well above the threshold where fine-tuning yields reliable gains. We shipped without fine-tuning and saved three weeks of work. The eval set was the investment that paid off, not the fine-tuning.
Practical fine-tuning recipe
When the conditions above are met and fine-tuning is the right call, here is the approach we use. The full stack: sentence-transformers 3.x, PyTorch 2.x, starting from BGE-large-en-v1.5 as the base model. This handles most English-domain fine-tuning cases.
Data format. You need query-positive-hard_negative triplets. At minimum 500, ideally 2000–10000 for production-grade quality. The query is a real user query from your domain. The positive is the correct document passage that answers it. The hard negative is a passage that is thematically similar but does not answer the query correctly — this is the most important and most often skipped step.
Hard negative mining. Hard negatives are passages that a naive model retrieves with high similarity but that are actually wrong answers. They force the model to learn fine-grained distinctions. Mine them by running your base model on the training queries, taking the top-20 retrieved passages, removing the known positives, and labeling the remaining candidates. Automatic hard negative mining with a model-in-the-loop is the standard approach: embed all queries with the base model, retrieve top-20 from your corpus, filter known positives, keep the rest as hard negatives.
Loss function. Use MultipleNegativesRankingLoss (MNRL) with in-batch negatives. Each positive in the training batch serves as a negative for all other queries in the batch. With a batch size of 64, each training step gives each query 63 in-batch negatives plus any explicit hard negatives you mined. MNRL is the standard for bi-encoder fine-tuning because it is efficient (no separate negative sampling step per batch) and effective.
For Matryoshka-aware fine-tuning, wrap MNRL in MatryoshkaLoss with your target dimensions list. This trains the model to produce high-quality embeddings at every truncation point simultaneously.
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
from datasets import Dataset
# Load base model
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Define loss: MRL wrapping MNRL for Matryoshka-aware training
inner_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
model,
inner_loss,
matryoshka_dims=[64, 128, 256, 512, 1024],
)
# Training data: list of {"query": ..., "positive": ..., "negative": ...}
# Hard negatives mined offline with the base model
train_dataset = Dataset.from_list([
{
"anchor": "What is the coverage limit for third-party liability?",
"positive": "Third-party liability coverage is capped at EUR 1.5M per incident under policy section 4.2.",
"negative": "Liability claims must be filed within 30 days of the incident date per standard policy terms.",
},
# ... 2000+ more triplets
])
args = SentenceTransformerTrainingArguments(
output_dir="bge-large-insurance-domain",
num_train_epochs=3,
per_device_train_batch_size=64,
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=True,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=train_loss,
)
trainer.train()
model.save_pretrained("bge-large-insurance-domain-final")
A few points on this recipe worth highlighting:
- Learning rate. 2e-5 is conservative and safe for bi-encoder fine-tuning starting from a strong base model. Going higher risks catastrophic forgetting of the model's general semantic representations. Going lower slows convergence without quality benefits.
- Batch size. Larger is better for MNRL because it increases the number of in-batch negatives. 64 is the practical minimum; 128–256 gives better results if your GPU memory allows (use gradient accumulation if needed).
- Evaluation. Always evaluate on a held-out domain eval set using Information Retrieval metrics (nDCG@10, recall@5), not just the training loss. Training loss convergence is not the same as retrieval quality improvement.
- Early stopping. With 2000 triplets and a 335M parameter model, overfitting can occur by epoch 4–5. Watch your eval nDCG — stop when it plateaus, not when training loss plateaus.
Tip
Use LLM-generated synthetic queries to bootstrap your training set if you lack labeled data. Feed each document chunk to an LLM with the prompt "Generate 3 realistic questions a user might ask whose answer is in the following passage:" and use the outputs as training queries paired with the source chunk as the positive. This approach can generate 5000+ training pairs from 500 documents in a few hours and $10–30 in API costs. The quality is good enough to start fine-tuning; real query data should be added as it accumulates in production.
Cost analysis: API vs self-hosted
Cost comparisons in this space are often done poorly — they compare raw token pricing without accounting for infrastructure overhead, throughput requirements, or operational maturity. Here is the analysis we run for clients.
Scenario 1: Early-stage RAG system, 10M tokens/month. This is a team of 2–3 engineers building an internal knowledge assistant. At $0.02/1M tokens, OpenAI text-embedding-3-small costs $0.20/month. Self-hosting BGE-large on a T4 (GCP n1-standard-4 + NVIDIA T4, preemptible) costs approximately $130–160/month including compute, with a one-time setup cost of 1–2 days of engineering time plus ongoing maintenance. The API wins at this scale, easily. Any team at this stage that chooses to self-host is optimizing for the wrong variable — unless they have a hard compliance constraint, in which case our self-hosted RAG architecture guide covers the full stack.
Scenario 2: Growth-stage system, 500M tokens/month. OpenAI costs $10,000/month at this volume. A two-T4 autoscaling deployment handles 500M tokens/month comfortably with appropriate batching (the T4 achieves roughly 2000 tokens/second at batch size 32 for BGE-large). Two T4 instances at sustained usage: approximately $600–900/month. Self-hosting becomes economically rational here, assuming you have the DevOps bandwidth to manage it.
Scenario 3: Enterprise, compliance-constrained, 2B tokens/month. OpenAI costs $40,000/month. Dedicated A100 cluster or on-premise GPU deployment. Self-hosting is the only option that is both economically viable and compliant with data residency requirements. At this scale, embedding costs are a meaningful line item that justifies full MLOps infrastructure.
One cost that is almost always underestimated: re-indexing. Every time you change your embedding model — because you upgraded, deprecated, or fine-tuned — you need to re-embed your entire corpus. For 50M documents at 512 tokens average, that is 25B tokens. At $0.02/1M tokens: $500. On a T4 at 2000 tokens/second: 138 hours of compute. Plan this into your architecture decision. If your corpus will grow to the point where re-indexing is painful, either commit to a self-hosted model you control or design your pipeline with a re-indexing budget from the start.
For the full end-to-end production architecture these decisions feed into — vector store selection, indexing pipelines, serving infrastructure — see our RAG systems service page.
Frequently asked questions
Further reading
- RAG: A Technical Guide — how retrieval pipelines are structured, where embeddings fit in the architecture, and when RAG beats fine-tuning.
- Production RAG: 5 Failure Modes — the operational and evaluation issues that cause RAG systems to underperform in production, including retrieval quality measurement.
- Agentic RAG — when you need to go beyond a single-shot retrieval pass, and how agents orchestrate multi-step retrieval strategies.
- Hybrid search and reranking (forthcoming) — the full architecture for combining dense retrieval, sparse BM25, and cross-encoder reranking, with latency budgets and quality trade-offs.
- Vector database comparison (forthcoming) — Qdrant vs Weaviate vs Pinecone vs Chroma, with a focus on multi-vector support, filtering performance, and production operability.
- RAG systems — Tensoria's end-to-end service for designing, building, and evaluating production RAG pipelines, including embedding selection and fine-tuning.
- MTEB Leaderboard — the live benchmark, filter to the Retrieval sub-task for RAG use cases.
- Matryoshka Representation Learning (Kusupati et al., 2022) — the original MRL paper, required reading for anyone making embedding dimension decisions.
- OpenAI Embeddings documentation — official docs covering MRL truncation, batch sizing, and the dimensions parameter.
- Sentence Transformers training documentation — the definitive reference for MNRL, MatryoshkaLoss, and the full fine-tuning API.
Talk to an engineer
Building a RAG system and not sure which embedding model fits your data? We run structured evals and recommend the right stack.
The decision you actually need to make
After reading all of this, the decision tree is simpler than the volume of content suggests. Start with text-embedding-3-small or bge-large-en-v1.5. Build a domain-specific eval set of 50–200 query-document pairs. Measure nDCG@10 on your data. If it is above 0.78, ship and monitor. If it is below 0.70 and a cross-encoder reranker doesn't close the gap, investigate fine-tuning. If you have a multilingual requirement, go to BGE-M3 first.
The embedding model is important, but it is not the highest-leverage decision in your retrieval stack. Evaluation rigor — knowing what your retrieval quality actually is on real user queries — is worth more than any model upgrade. Teams that build eval infrastructure before they optimize models make better decisions faster. Teams that optimize models before they can measure the impact spend weeks running experiments whose outcomes they cannot interpret.
If you want help designing the eval framework alongside the model selection decision, book a call. We have built enough of these systems to know where the real leverage is — and it is almost never where teams expect it to be. See our RAG systems service for what a structured engagement looks like.