Tensoria
Infrastructure By Anas R.

Vector Database Comparison: Pinecone vs Qdrant vs Weaviate vs pgvector in Production

There is no universally correct answer to "which vector database should I use." There is only the right answer given your scale, your team's operational capacity, your existing stack, and whether you need hybrid search, filtered retrieval, or multi-tenancy. What follows is an engineering-first comparison based on production deployments — not vendor benchmarks, not marketing copy. If you are building a RAG system and choosing your vector store, this is the guide you want before making that decision.

We will cover Pinecone, Qdrant, Weaviate, and pgvector in depth, with a briefer look at Milvus and Chroma. For each we examine: indexing algorithms and tuning parameters, filter performance architecture, hybrid search support, ingest throughput, query latency at different scales, replication and durability, and total cost of ownership. We will also cover the engineering trade-offs that benchmarks do not capture — ops burden, migration cost, and what happens when something breaks at 3am.

The bottom line, stated upfront: for most SMB RAG use cases under 5M vectors, pgvector is the correct default — it lives in your existing Postgres, eliminates an entire service from your ops surface, and has reached production maturity with HNSW indexing. You should reach for a dedicated vector database only when you hit its limits in a measurable way.

ANN algorithms: HNSW, IVFFlat, and ScaNN

Every vector database at its core is an approximate nearest neighbor (ANN) search engine. Understanding the three dominant index types is prerequisite knowledge for everything that follows — index choice determines recall, latency, memory footprint, and ingest cost.

HNSW (Hierarchical Navigable Small World)

HNSW, introduced by Malkov and Yashunin in their 2016 paper, builds a layered proximity graph at index time. The top layer is sparse with long-range edges; the bottom layer is dense with short-range edges. A query enters at the top, greedily descends toward the nearest region, and terminates at the bottom layer with a local neighborhood search.

The key parameters you need to understand:

  • m (default 16): the number of bidirectional edges per node at construction time. Higher m = better recall, more memory, slower inserts. m=16 is a sane default for most use cases. Push to m=32 if you're at 10M+ vectors and recall is paramount.
  • ef_construction (default 64): the size of the dynamic candidate list during graph construction. Higher values = better-quality graph = better recall, but slower index builds. ef_construction=64 is the standard starting point.
  • ef (query-time, default 64): the size of the candidate list during search. This is the primary recall/latency dial at query time — increasing ef improves recall at the cost of latency. In production, tune this parameter against your latency SLA and your measured recall on a golden set.

HNSW requires no training step, supports incremental inserts without full reindex, and delivers O(log n) query complexity. Its main cost is memory: a full in-memory HNSW graph for 1M 1536-dimensional float32 vectors weighs approximately 6–7 GB.

IVFFlat (Inverted File with Flat quantization)

IVFFlat partitions the vector space into n_lists Voronoi cells using k-means clustering. This training step requires a representative sample of your data before you can build the index. At query time, the system probes n_probes cells (typically 10–30% of n_lists) and returns the best candidates from those cells.

The trade-off: IVFFlat uses significantly less memory than HNSW — it does not store a graph. For very large datasets (100M+) where HNSW's memory footprint becomes prohibitive, IVFFlat or its product-quantized variant (IVFPQ) is often the practical choice. The cost is recall degradation when n_probes is too low and the inherent brittleness of the training step: if your data distribution shifts significantly, your cluster assignments become stale.

ScaNN and DiskANN

Google's ScaNN (used internally by Vertex AI Vector Search) and DiskANN (used by Azure AI Search and some Weaviate configurations) are SSD-optimized ANN algorithms that allow indexes to exceed RAM. They are relevant when you are operating at 100M+ vectors and cannot afford to keep the entire index in memory. For most production RAG workloads below 50M vectors, HNSW in RAM is the practical choice — you do not need these.

Pinecone: managed simplicity, managed constraints

Pinecone is a fully managed, proprietary vector database. You get an API key, create an index, and start querying. There are no servers to size, no graphs to tune, and no infrastructure decisions to make. This is both its greatest strength and its most significant constraint.

What Pinecone does well

Time to first query is minutes, not days. The API is clean and well-documented. Serverless mode (introduced in 2024) scales to zero when idle, which makes it genuinely cheap for low-traffic RAG prototypes. SLA, uptime, replication, and backups are handled by Pinecone. If your team has no Kubernetes experience and you are trying to validate a product before committing to infrastructure, Pinecone removes the largest ops friction.

Pinecone also introduced namespaces, which enable basic multi-tenancy by partitioning a single index into logical tenant spaces. This works well for SaaS products with hundreds to low thousands of tenants — beyond that, namespace management becomes painful and cross-namespace queries are not supported.

What hurts

You cannot tune HNSW parameters. Pinecone's internal index is a black box — you cannot set m, ef_construction, or ef. This means you cannot trade memory for latency, cannot optimize for your specific vector dimensionality, and cannot investigate why recall is degrading. In our experience, recall floors around 0.92–0.95 are common on Pinecone Serverless, which is acceptable for most RAG use cases but not for applications requiring very high recall on filtered queries.

Metadata filtering is post-filter by default. Pinecone applies your metadata filter after the ANN pass, which means highly selective filters (e.g., filtering to a specific tenant's documents in a large shared index) can return fewer results than the top_k you requested. This is the single biggest operational pain point teams hit at scale.

Cost at scale is real. Pinecone's p2 performance pod handles approximately 1M vectors at around $70/month. At 10M vectors you are looking at $700+/month before egress. Pinecone Serverless pricing is more nuanced — you pay per read unit and write unit — but at sustained high throughput it is rarely cheaper than pods.

When to pick Pinecone

Prototypes and early-stage products where ops bandwidth is zero. Teams that need production in days. Workloads under 5M vectors with moderate query load. Any situation where the engineering cost of managing a self-hosted cluster exceeds the pricing delta. If compliance, IP protection, or cost-at-scale pushes you off managed services entirely, see our self-hosted RAG architecture guide for the full open-source stack.

Lesson learned

We migrated a SaaS document Q&A product from Pinecone to Qdrant at 8M vectors. Monthly infrastructure cost dropped from $680 to $190 — a 72% reduction. What we absorbed: a two-week migration sprint, a managed Kubernetes cluster (GKE Autopilot), and ongoing responsibility for Qdrant upgrades and snapshot backups. The math worked because we already had Kubernetes expertise in-house. If we had not, the $490/month delta would have been cheaper than the engineering time.

Qdrant: the filter-first vector engine

Qdrant is an open-source vector database written in Rust. It was designed from the ground up with filtered search as a first-class concern, not an afterthought. The architecture decision that separates it from most competitors is its payload-indexed HNSW: the graph is built with awareness of your metadata (payload) fields, so filtered queries traverse the right subgraph rather than scanning and discarding post-hoc.

What Qdrant does well

Filtered search performance is genuinely best-in-class. Qdrant adds 1–2ms overhead for filtered queries on millions of vectors, where post-filtering approaches can degrade to 20–40ms or return fewer results than requested. If your RAG system retrieves per-tenant, per-document-type, or per-date-range, Qdrant's pre-filter architecture is a material advantage.

Sparse vector support (since Qdrant 1.7) enables native hybrid search by storing both dense and sparse (BM25-style) representations of the same document. A single query can retrieve and fuse results from both indexes using RRF (Reciprocal Rank Fusion) or a weighted linear combination. This is the cleanest implementation of hybrid search among the options covered here.

The Rust implementation yields high throughput per CPU cycle. A self-hosted Qdrant on a 4vCPU/16GB instance handles 5M+ vectors at 4ms p50 query latency for 768-dimensional embeddings with standard HNSW parameters. That same instance manages 10M vectors at acceptable latency if you use scalar quantization (int8) to halve the memory footprint.

Qdrant supports on-disk HNSW with a memmap payload store — indexes that exceed RAM are paged from NVMe, which allows you to manage 50M+ vectors on a machine with 32GB RAM at the cost of slightly higher latency on cold reads.

What hurts

Self-hosted ops are real. You are responsible for rolling upgrades, snapshot management, hardware sizing, and monitoring. Qdrant Cloud (the managed offering) removes most of this but adds back cost — at 10M vectors, Qdrant Cloud pricing is in the $200–350/month range depending on region, which is still meaningfully cheaper than Pinecone at equivalent scale.

The distributed mode (multi-node Qdrant cluster with sharding and replication) works, but requires careful capacity planning. Rebalancing shards after a node addition takes time and increases write amplification. For most use cases under 30M vectors, a single well-sized Qdrant instance with snapshot-based backup is operationally simpler than a cluster.

GraphQL and REST APIs are both solid. The Python client is production-quality. The Go and Rust clients are excellent. If your stack is Java or .NET, the clients exist but are less mature.

Qdrant filter query example

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, SearchRequest

client = QdrantClient(host="localhost", port=6333)

results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="tenant_id",
                match=MatchValue(value="acme-corp")
            ),
            FieldCondition(
                key="doc_type",
                match=MatchValue(value="contract")
            )
        ]
    ),
    limit=10,
    with_payload=True
)

This query hits Qdrant's payload-indexed HNSW directly — the filter is applied during graph traversal, not after. At 10M vectors with 1% matching the filter, this returns 10 results in 4–8ms. A post-filtering approach at the same selectivity would need to retrieve ~1000 results and discard 990, degrading both latency and the returned result count.

When to pick Qdrant

Any RAG system with mandatory metadata filtering (per-tenant, per-document-type, per-date). Workloads where hybrid search (BM25 + dense) is a core requirement. Teams with Kubernetes or Docker operational capacity who want the best price-performance ratio at 5M–100M vectors.

Weaviate: hybrid-native, schema-heavy

Weaviate is an open-source vector database with a schema-first design. Where Qdrant gives you a flexible payload JSON, Weaviate gives you a typed schema with classes and properties. This constraint is also its strength for certain use cases: schema enforcement means better data consistency at ingest, and Weaviate's built-in modules can generate embeddings automatically, abstracting the embedding step from your application code.

What Weaviate does well

Native hybrid search is Weaviate's defining feature. Its BM25 + dense vector fusion is first-class, not bolted on. alpha=0.5 in a Weaviate query blends keyword and semantic signals equally; tuning alpha per query type is straightforward. For knowledge-base search where both exact keyword matching and semantic similarity matter — technical documentation, legal corpora, product catalogs — Weaviate's hybrid search is the most production-tested implementation available.

Multi-tenancy via Weaviate's multiTenancyConfig is the cleanest implementation in the comparison. Each tenant gets isolated HNSW indexes with independent memory, which means one noisy tenant cannot degrade performance for others. At 1,000+ tenants with non-uniform data sizes, Weaviate's architecture handles this significantly better than Pinecone namespaces or a single Qdrant collection with tenant filters.

Weaviate's module system allows you to plug in OpenAI, Cohere, or HuggingFace embedding models and have Weaviate call them at ingest and query time. For teams that want a single service handling both embedding and search, this reduces application complexity. For teams that want control over the embedding pipeline — batching, caching, retry logic — this abstraction is more friction than help.

What hurts

Weaviate is the heaviest option here. A minimal Weaviate deployment requires 2–4 GB of RAM for the JVM alone before any vectors are loaded. For a workload that fits comfortably in a 16GB Qdrant instance, Weaviate typically needs 32GB+. This is not a concern at large scale, but it is a material cost difference for smaller workloads.

The GraphQL query API (Weaviate's primary query interface) is expressive but verbose. Simple filtered searches that are 3 lines in Qdrant's Python client require 15-line GraphQL queries in Weaviate. The REST API is cleaner, but documentation historically favored GraphQL examples. Weaviate 1.24+ introduced a gRPC-based client that is significantly faster for high-throughput queries, but the ecosystem is still maturing.

Schema migrations are painful. Adding a new property to a class in Weaviate requires careful handling — in some versions this triggers a reindex. If your document schema evolves frequently (new metadata fields, changing document types), Weaviate's schema enforcement becomes operational friction rather than a safety net.

When to pick Weaviate

Multi-tenant SaaS applications where tenant isolation is a hard requirement. Workloads where hybrid search quality is paramount and you want a battle-tested implementation. Teams that want to offload embedding generation to Weaviate's module system. Organizations with a Java/JVM operational background who are comfortable with the tooling. If you index visual document embeddings (ColPali, ColQwen2) for multimodal RAG, expect roughly 10–20x storage overhead vs text-only — size your cluster accordingly.

Lesson learned

A legal tech client was using Weaviate for hybrid search across 2M contract documents. The hybrid search quality was excellent — alpha tuning per query type gave measurable precision improvements over pure dense search. The pain point was Weaviate's memory footprint: 48GB RAM for 2M vectors where an equivalent Qdrant deployment used 18GB. After profiling, most of the overhead was the JVM and Weaviate's internal object store. For document Q&A specifically, the recall improvement from hybrid search justified the memory cost. For pure semantic search workloads, it would not.

pgvector: Postgres-native, surprisingly capable

pgvector is a Postgres extension, not a standalone database. Vectors are stored in a Postgres table column of type vector(n), indexed with either HNSW or IVFFlat, and queried with a standard SQL distance operator. This is simultaneously its greatest constraint (it is bound to Postgres's execution model) and its most significant advantage (it eliminates an entire service from your architecture).

What pgvector does well

The operational simplicity argument is real. If you already run Postgres — and most applications do — adding pgvector means adding an extension, a column, and an index. No new service, no new auth system, no new monitoring setup, no data synchronization pipeline between your transactional database and your vector store. Vectors live in the same transaction boundary as your other data: you can join on them, filter them with arbitrary SQL, and update them atomically in the same transaction as your document table.

pgvector's HNSW implementation (added in pgvector 0.5.0, significantly improved in 0.7+) is production-grade. The default parameters (m=16, ef_construction=64) produce a high-quality index for most use cases. At 1M 1536-dimensional vectors, pgvector HNSW achieves p50 latency of 8–15ms on a modern Postgres instance — slower than Qdrant's 4ms on equivalent hardware, but within the acceptable range for most RAG pipelines where LLM inference is the dominant latency source by an order of magnitude.

pgvector 0.8+ added iterative index scans for filtered queries, which improves filtered recall significantly compared to earlier versions that degraded to sequential scans under selective filters. It also added sparse vector support in 0.8, enabling basic hybrid search workflows.

pgvector HNSW index creation

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create your documents table
CREATE TABLE documents (
    id          BIGSERIAL PRIMARY KEY,
    content     TEXT NOT NULL,
    tenant_id   TEXT NOT NULL,
    doc_type    TEXT NOT NULL,
    embedding   vector(1536),
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- Create an HNSW index on the embedding column
-- m=16 ef_construction=64 is the standard starting point
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Create indexes on your filter columns
-- Critical: without these, filtered queries degrade to seqscans
CREATE INDEX ON documents (tenant_id);
CREATE INDEX ON documents (doc_type);

-- Tune ef at session level for recall/latency trade-off
SET hnsw.ef_search = 64;

-- Query: top-10 nearest neighbors filtered by tenant
SELECT id, content, 1 - (embedding <=> $1::vector) AS score
FROM documents
WHERE tenant_id = 'acme-corp'
  AND doc_type = 'contract'
ORDER BY embedding <=> $1::vector
LIMIT 10;

One critical operational note: pgvector HNSW index builds are single-threaded by default. On a table with 5M rows, an initial index build takes 15–45 minutes depending on hardware. Set max_parallel_maintenance_workers and use SET maintenance_work_mem = '8GB' before the build to parallelise and reduce that time significantly.

What hurts

At 10M+ vectors, pgvector's query latency under load diverges from dedicated databases. Postgres's buffer pool management is not optimized for vector workloads — if your HNSW index does not fit entirely in shared_buffers, you will see cache-miss latency spikes under concurrent load. Dedicated vector databases (Qdrant, Weaviate) use memory-mapped files and purpose-built I/O paths that handle this more gracefully.

Ingest throughput is limited by Postgres write amplification. HNSW index updates are expensive — each insert requires traversing the graph to find the insertion point. At sustained ingest rates above 500 vectors/second, you will need to batch-disable index updates, bulk-insert, and then reindex. This is manageable but requires care.

pgvector has no sparse vector or native hybrid search support that is operationally clean. You can combine tsvector full-text search with vector search in application logic, but there is no built-in fusion API. If hybrid search is a core requirement, Qdrant or Weaviate are stronger options.

When to pick pgvector

Any RAG system where your corpus is under 5M vectors and you are already running Postgres. Teams who prioritize operational simplicity and transactional consistency over raw query throughput. Situations where your vector data needs to join against relational data frequently. As a starting point before you have hit scale limits that justify a migration.

Lesson learned

For an internal document assistant at a 200-person professional services firm, we chose pgvector on RDS over Pinecone. The corpus was 800K document chunks. The RDS instance (db.r6g.large, $180/month) handled the entire RAG workload including the vector index. We had joins between the vector search results and a metadata table, transactional deletes when documents were removed, and no additional service to manage. At that scale, pgvector was the boring, correct choice. Two years later, the system still runs without incident on the same stack.

Milvus and Chroma: when they fit

Milvus

Milvus is the right choice when you are operating at 100M+ vectors with high write throughput requirements and need a distributed architecture with fine-grained resource isolation. It supports multiple index types (HNSW, IVF, ScaNN, DiskANN), GPU-accelerated search (meaningful for 100M+ vector workloads), and a tiered storage model that offloads cold vectors to object storage. The operational complexity is significant — Milvus requires Etcd, MinIO or S3, and a message queue (Pulsar or Kafka) in addition to the query and data nodes. For most workloads under 50M vectors, Qdrant is simpler and faster. Milvus is worth evaluating when you are past that threshold and need horizontal scaling with strong consistency guarantees.

Chroma

Chroma is excellent for local development and rapid prototyping. Its in-process Python mode (no server required) makes it the fastest way to test an embedding pipeline. It is not a production database. Chroma's distributed mode is less mature than its local mode, durability guarantees are weaker than the alternatives, and query performance at 1M+ vectors is not competitive with HNSW-based systems. Use Chroma to build and iterate locally; use something else in production. The migration from Chroma to pgvector or Qdrant is straightforward — treat it as a deliberate architectural step at the end of your prototype phase, not a painful migration.

Filter performance: pre-filter vs post-filter

Filtered vector search is where architectures diverge most significantly. Understanding the two approaches is critical for production deployments where most queries include metadata constraints.

Post-filtering runs the ANN search first on the full index, retrieves top-k candidates, then applies the metadata filter. This is computationally efficient for large indexes, but has a fundamental correctness problem: if your filter is highly selective (e.g., only 1% of vectors match a given tenant_id), the ANN pass must retrieve 100x more candidates than requested to have any chance of returning k matching results. Most implementations do not over-retrieve by default — they return fewer than k results, which breaks the assumption callers make about the result set size. This is the primary source of "why does my RAG only return 3 results when I asked for 10" bugs in production.

Pre-filtering applies the metadata filter before or during ANN traversal. The ANN algorithm only considers vectors matching the filter. This guarantees recall — you will always get up to k results from the matching set — but requires the index to be metadata-aware at build time, which increases complexity and memory footprint.

Among the databases covered here:

  • Qdrant: pre-filter via payload-indexed HNSW. Best-in-class filtered performance. Adds 1–3ms overhead regardless of filter selectivity.
  • Weaviate: pre-filter via a separate inverted index on properties. Strong filtered performance, slightly higher overhead than Qdrant on very selective filters.
  • pgvector: post-filter by default. Iterative index scans in 0.8+ improve this significantly, but performance still degrades on selective filters. For high-selectivity filters, adding a BTree index on filter columns and using SET enable_indexscan = on with appropriate planner statistics helps.
  • Pinecone: post-filter with no tuning ability. The most opaque behavior here — you cannot control how many candidates are retrieved before filtering.

Pure dense vector search has a well-documented failure mode: it handles semantic similarity well but fails on exact keyword matches, product codes, proper nouns, and any token that does not have a strong learned representation. A query for "invoice INV-2024-8832" in a dense-only system will retrieve semantically similar invoices rather than that specific invoice. Hybrid search combines BM25 (exact keyword matching via inverted index) with dense vector search, fusing the two result sets before returning.

The fusion methods:

  • RRF (Reciprocal Rank Fusion): each candidate's score is 1 / (k + rank) from each retriever, summed. Simple, parameter-free, robust. Default in most implementations.
  • Linear combination: alpha * dense_score + (1 - alpha) * sparse_score. Requires tuning alpha on your specific corpus; can outperform RRF when tuned, worse when not.

Support by database:

  • Qdrant: native sparse vectors + dense vectors in one collection, RRF and linear fusion via query API. The cleanest production implementation.
  • Weaviate: native BM25 module + dense vectors, alpha parameter in query. Well-tested, battle-hardened for knowledge-base workloads.
  • pgvector: no built-in sparse support. You combine tsvector full-text search with vector distance in application code, then fuse manually. Works, but requires more application logic.
  • Pinecone: sparse-dense hybrid available on the non-serverless tier. The sparse vectors must be generated externally (e.g., with BM25 or SPLADE). Less seamless than Qdrant or Weaviate.

For production RAG, implementing hybrid search typically improves recall by 8–15% on knowledge-heavy corpora. For more on retrieval quality and reranking strategies, see our forthcoming article on hybrid search and reranking.

Ops reality at 1M, 10M, and 100M vectors

At 1M vectors

All four databases handle 1M vectors comfortably on modest hardware. The differentiation here is ops burden, not performance. pgvector on an existing Postgres instance is zero additional ops cost. Pinecone Serverless is similarly frictionless. Qdrant on a 2vCPU/8GB instance handles 1M vectors with headroom. Weaviate is overengineered for this scale — its memory baseline will consume most of an 8GB instance before vectors are loaded. Recommended order of preference at 1M vectors: pgvector > Pinecone Serverless > Qdrant > Weaviate.

At 10M vectors

This is where pgvector's limitations become measurable. An HNSW index for 10M 1536-dimensional float32 vectors takes 60–70 GB of RAM — more than most single Postgres instances have available, and significantly more than pgvector can realistically keep hot in shared_buffers. You can use scalar quantization to halve this, but even 30GB dedicated to a vector index is a meaningful infrastructure decision. Qdrant with int8 quantization handles 10M vectors on a 16GB instance at 6–10ms p50 latency. Pinecone at 10M vectors costs $700–900/month on pod-based indexes. The self-hosted vs managed trade-off is at its most economically interesting at this scale.

At 100M vectors

The realistic options are Qdrant with on-disk HNSW and NVMe storage, Weaviate on a sized cluster, or Milvus distributed. Pinecone at this scale is expensive enough ($5,000+/month) that the ops cost of self-hosting is economically justified for almost any team. pgvector is not competitive at this scale without DiskANN-style paging, which is not yet supported. Milvus with tiered storage becomes attractive here — it can store hot vectors in RAM, warm vectors on NVMe, and cold vectors in S3, with automatic promotion and demotion.

Cost comparison at different scales

Numbers below are approximate as of 2026. Self-hosted costs assume standard cloud compute pricing (AWS/GCP/Azure), excluding reserved instance discounts. Managed pricing is from published vendor pricing pages.

  • 1M vectors, 768-dim: pgvector (existing Postgres, ~$0 incremental) / Qdrant Cloud (~$50/month) / Pinecone Serverless (~$20–40/month at moderate query volume) / Weaviate Cloud (~$60/month)
  • 10M vectors, 768-dim: pgvector self-hosted on a 64GB RDS instance (~$250–400/month) / Qdrant self-hosted on 32GB instance with int8 quantization (~$120–180/month compute) / Qdrant Cloud (~$200–350/month) / Pinecone pods (~$700–900/month) / Weaviate Cloud (~$450/month)
  • 100M vectors, 768-dim: Qdrant self-hosted with on-disk HNSW (~$600–900/month for a cluster with NVMe) / Milvus on Kubernetes (~$800–1,200/month) / Pinecone pods ($4,000–7,000+/month) / Weaviate self-hosted cluster (~$700–1,000/month)

The pattern is consistent: managed services trade cost for ops simplicity. The crossover where self-hosted becomes clearly economical is typically around $300–500/month in managed costs, which corresponds to roughly 5–10M vectors for most use cases.

Comparison table

Dimension Pinecone Qdrant Weaviate pgvector
Hosting model Managed only Self-hosted or Cloud Self-hosted or Cloud Self-hosted (Postgres)
Index type Proprietary (HNSW-like, not tunable) HNSW (payload-indexed) HNSW + inverted index HNSW or IVFFlat
Hybrid search Sparse-dense (non-serverless) Native (RRF / linear) Native BM25 + dense Manual (tsvector)
Filter performance Post-filter, opaque Pre-filter, best-in-class Pre-filter, strong Post-filter (iterative in 0.8+)
Multi-tenancy Namespaces (basic) Payload filter or collections Native (isolated indexes) Schema-level (SQL WHERE)
Scale ceiling ~100M (managed cost) 100M+ (on-disk HNSW) 100M+ (cluster) ~5M practical; 10M+ with care
Cost at 10M vectors $700–900/month $120–350/month $400–600/month $250–400/month
Ops complexity None Medium (if self-hosted) High (JVM, schema) Low (existing Postgres ops)
HNSW tuning None Full (m, ef_construction, ef) Full Full (m, ef_construction, ef_search)

Recommendation matrix

This is the decision framework we use when starting a new RAG project:

  • Under 5M vectors, existing Postgres stack: use pgvector. No additional service, no migration risk, transactional consistency, zero incremental ops cost. This covers the majority of SMB and internal-tool RAG use cases.
  • Under 5M vectors, no Postgres, need simplicity fast: use Pinecone Serverless. Pay the cost premium for the ops-free experience during early product validation. Migrate when the monthly bill exceeds what a part-time engineer would cost in Kubernetes hours.
  • 5M–50M vectors, filtered search is critical, team has container ops experience: use Qdrant self-hosted. Best price-performance ratio, best filter performance, clean Python API. Budget a week for initial cluster setup and snapshot automation.
  • Multi-tenant SaaS, 1,000+ tenants, hybrid search required: use Weaviate. Its multi-tenancy model and hybrid search quality justify the memory overhead and operational weight at this specific use case profile.
  • 100M+ vectors, high write throughput, need distributed architecture: evaluate Milvus. This is the only scenario where Milvus's operational complexity is justified over a well-tuned Qdrant cluster.
  • Prototyping and local development: use Chroma or Qdrant Docker. Do not use Pinecone for prototyping — it builds coupling to a managed API early in the development cycle. Qdrant's Docker image boots in seconds and behaves identically to production.

Lesson learned

The most common mistake we see is over-engineering the vector store choice relative to the actual workload. An internal knowledge assistant for a 150-person company has 300K–1M document chunks at most. For that workload, the vector store is the least interesting engineering problem. The retrieval quality, the chunking strategy, the metadata schema, and the evaluation pipeline matter far more than whether you chose Qdrant over Weaviate. Start with pgvector. Migrate when you have a measured reason to.

Further reading

  • RAG: A Technical Guide — How RAG works end-to-end, from chunking strategy to vector store to generation. The foundation before this article.
  • Production RAG: 5 Failure Modes We Keep Seeing — The evaluation, retrieval, and observability failures that matter more than vector store choice.
  • Hybrid Search and Reranking — Forthcoming deep-dive on BM25 + dense fusion, RRF vs linear combination, and cross-encoder reranking in production.
  • Embedding Models in 2026 — Forthcoming comparison of text-embedding-3-large, Cohere Embed v3, and open-source models. Choosing the right embedding model is as important as choosing the right vector store.
  • RAG Systems — Tensoria's end-to-end service for production RAG, including vector store selection, eval infrastructure, and observability.
  • HNSW Paper (Malkov & Yashunin) — The original research. Understanding it is worth an hour of your time if you are tuning HNSW at scale.
  • ann-benchmarks.com — Independent ANN algorithm benchmarks. Useful for hardware-to-hardware QPS comparisons; less useful for production trade-off decisions.
  • Qdrant documentation — Especially the sections on payload indexing, quantization, and on-disk HNSW configuration.

Talk to an engineer

Building a RAG system and not sure which vector store fits your scale? We help teams make that call — and build the system around it.

Our RAG service

FAQ

At 1M vectors, pgvector on your existing Postgres instance is effectively free. At 10M vectors, a self-hosted Qdrant on a 4vCPU/16GB instance costs $30–60/month in compute. Pinecone's p2 pod at equivalent scale runs $700+/month. At 100M vectors, Milvus or Qdrant on managed Kubernetes is typically 3–5x cheaper than Pinecone, but you absorb the ops cost of running the cluster.

Yes, with caveats. pgvector 0.8+ with HNSW indexing handles 1M–5M vectors well on typical Postgres hardware. For 10M+ vectors or sub-5ms p99 latency requirements, dedicated vector databases have a clear edge. pgvector's main strength is that it lives inside your existing Postgres transaction boundary — no additional service, no data sync, no separate auth layer. For most SMB and internal-tool RAG workloads, it is the correct default.

HNSW builds a multi-layer proximity graph at index time — queries traverse from coarse to fine layers in O(log n) hops with no training required. IVFFlat partitions vectors into Voronoi cells via k-means clustering (requires a training step) and probes a subset of cells at query time. HNSW generally wins on the recall/latency trade-off and supports incremental inserts cleanly. IVFFlat uses less memory and builds faster for large initial loads. For most RAG workloads, HNSW with m=16 ef_construction=64 is the right default.

Yes. Qdrant supports sparse vectors natively since version 1.7, enabling BM25-style sparse retrieval combined with dense vector retrieval in a single query using RRF or weighted linear fusion. Weaviate also has strong native hybrid search. pgvector does not support hybrid search natively — you would combine it with tsvector full-text search in application logic. Pinecone supports sparse-dense hybrid on non-serverless indexes, with external sparse vector generation required.

Post-filtering runs ANN search first on the full index, then applies metadata filters to the result set. This is fast but has a recall problem when filters are selective — you may get fewer results than your top_k because most candidates were filtered out. Pre-filtering applies metadata conditions before or during ANN traversal, guaranteeing up to k results from the matching subset. Qdrant's payload-indexed HNSW adds only 1–3ms overhead for pre-filtering regardless of selectivity. pgvector defaults to post-filtering, improved but not eliminated in pgvector 0.8+.

Pinecone makes sense when you have no Kubernetes expertise, need to be in production within days, and the cost delta (typically 2–4x over self-hosted at equivalent scale) is acceptable relative to engineering time saved. It also fits prototypes that need to validate product-market fit before committing to infrastructure decisions. Once you cross 10M vectors or $500/month on Pinecone, the self-hosted math usually tips in favor of Qdrant or Weaviate — provided your team can absorb the ops responsibility.

Anas Rabhi, data scientist specializing in RAG systems and vector databases
Anas Rabhi Data Scientist & Founder, Tensoria

I am a data scientist specializing in generative AI and production RAG systems. I help engineering teams select, deploy, and scale vector infrastructure that matches their actual workload — not the one described in vendor marketing. Process automation, internal knowledge assistants, intelligent document processing — I build systems that integrate into existing stacks and deliver measurable results.