Vector Store Sizing: The Cost Truth Nobody Tells You
- Embedding dimensionality is the largest single cost driver. Cutting from 1536 to 768 dimensions halves storage and roughly halves query cost.
- Replication factor has a 2x to 3x cost multiplier and is often set higher than the workload needs.
- Hybrid (BM25 + dense) costs more than dense-only but lifts recall enough to be worth it on heterogeneous corpora.
- Self-hosted on a small VM beats managed services below roughly 5 million vectors. Above that, managed wins on operational cost.
The first time we sized a vector store on an engagement, we asked the team what their projected query volume would be. They had not measured it. We asked what their target latency was. They had not set one. We asked what dimensionality their embeddings used. They knew this one (1536, the OpenAI default). We then quoted a managed Pinecone bill that made the room go quiet.
After two weeks of profiling, the right answer was a 4-vCPU Qdrant instance on a 100-dollar-per-month VM. The cost difference was about 12x. None of it was about the database brand; it was about the workload assumptions the team had not measured.
This piece is the configuration arithmetic that moves vector store bills by an order of magnitude. The honest cost models we have measured across vendors, and the levers that matter regardless of which one you pick.
What a million vectors actually costs
For a baseline workload (1 million vectors, 1536 dimensions, 10 queries per second peak, 100ms target latency, single region), here is what we typically see in 2026 pricing:
| Setup | Storage cost | Query cost (at 10 QPS sustained) | Notes |
|---|---|---|---|
| Pinecone serverless (1536d) | ~$70/mo | ~$120/mo | Pay-per-read, scales with traffic |
| Pinecone p1.x1 pod | ~$70/mo | included | Fixed pod cost, scales with replica count |
| Weaviate Cloud (sandbox tier) | $25/mo | included | Tier breaks ~5M vectors |
| Qdrant Cloud (1GB cluster) | ~$60/mo | included | Sufficient for 1-2M vectors at 1536d |
| Self-hosted Qdrant on $80 VM | $80/mo | included | 4 vCPU, 16GB RAM handles this easily |
| Self-hosted pgvector on existing Postgres | $0 marginal | ~$0 | If Postgres already running |
Three things change these numbers significantly:
Dimensionality. Each vector at 1536 dimensions in float32 takes 6KB. At 768 dimensions, 3KB. At 384 dimensions, 1.5KB. Storage scales linearly. Query cost scales sublinearly but still meaningfully. Cutting dimensionality in half is the single highest-leverage cost lever.
Replication factor. A single replica handles read traffic up to its limits. Two replicas double the bill. High availability often calls for three replicas across zones. We routinely see teams running three replicas on workloads where one would meet the SLO; the team set up HA by default without measuring whether the workload needed it.
Index type. HNSW (hierarchical navigable small world) is fast at query time but builds a memory-resident index roughly 1.5x to 2x the size of the raw vectors. IVF (inverted file with flat compression) trades query speed for lower memory. For latency-sensitive workloads HNSW is the right default; for cost-sensitive workloads with tolerable latency, IVF can cut memory usage substantially.
The three configuration levers that matter most
Lever 1: dimensionality. Most teams use the default dimensionality of whatever embedding model they picked first (typically 1536 from OpenAI). Almost nobody re-evaluates after the first month. Measuring recall@5 at multiple dimensionalities on a labelled question set takes an afternoon and almost always finds that lower dimensions are usable.
For a corpus we audited last quarter, recall@5 was:
| Dimensionality | Recall@5 | Storage per million vectors |
|---|---|---|
| 1536 | 0.91 | 6.0 GB |
| 1024 | 0.90 | 4.0 GB |
| 768 | 0.89 | 3.0 GB |
| 512 | 0.85 | 2.0 GB |
| 256 | 0.78 | 1.0 GB |
The team had been running 1536d “to be safe.” Moving to 768d cost them less than 2% recall and saved 50% of storage. Combined with the lower query cost (vector compare scales with dimensions), the monthly bill dropped 45%.
The model used must support truncation cleanly. OpenAI’s text-embedding-3-large is Matryoshka-trained, which means it loses minimal quality when truncated. ada-002 is not, and truncation hurts more there.
Lever 2: replication factor. A single replica is enough for many production workloads. Two replicas double cost. Three replicas (the typical HA default) triple it.
The honest test for replication is: what is your tolerated unavailability for the vector store specifically, and how often do you actually deploy or restart it? A vector store that is part of a customer-facing path probably needs two replicas. A vector store that backs an internal search tool probably needs one. A vector store that backs a batch job that runs nightly genuinely does not need any replication; failover can be a manual restart.
Lever 3: hybrid versus dense-only retrieval. Hybrid search (BM25 + dense) typically lifts recall by 5 to 15 percent on heterogeneous corpora (mixed query types, exact-match needs alongside semantic). It also doubles the per-query cost: two index lookups instead of one, plus rank fusion.
The trade is worth it when:
- Queries include identifiers, product codes, or specific names where exact match matters
- The corpus has high lexical diversity
- A few percent recall gain is worth the cost increase
The trade is not worth it when:
- All queries are conversational (“how do I do X”)
- The corpus is internally consistent in language
- Dense-only already exceeds your recall target
We default to hybrid for general-purpose RAG and dense-only for narrow conversational corpora.
Self-hosted versus managed: where the line is
The most common over-spend pattern is paying for a managed vector store when self-hosted would meet the workload at a fraction of the cost.
The honest line we see in 2026:
| Workload | Self-hosted recommended | Managed recommended |
|---|---|---|
| < 1M vectors, low traffic | pgvector or single Qdrant container | Overkill |
| 1-10M vectors, low to moderate traffic | Qdrant on a $100-300/mo VM | Optional |
| 10-50M vectors, multi-region | Possible but operational cost rises | Often worth it |
| > 50M vectors, high QPS, multi-region | Possible if dedicated team | Strongly recommended |
Self-hosted costs are dominated by the VM bill plus engineering time to operate. Managed costs are dominated by service fees but include the operational layer. The crossover depends on what you value more: the cash bill or the engineer-week per quarter spent on backups, upgrades and incident response.
For a team without dedicated platform engineering, the operational cost of self-hosted is often higher than it looks. The team’s senior engineer spends two days a quarter on routine vector-store maintenance; if that engineer’s time is worth more than the managed service’s premium, managed wins on real cost even though self-hosted wins on cash cost.
Operational gotchas we have seen
Index rebuilds during traffic. HNSW rebuilds are CPU-heavy and can spike query latency for hours. If your vector store is single-replica and you rebuild during business hours, you have an outage. Rebuild during off-peak or use a multi-replica setup with rolling rebuild.
Embedding drift on model updates. When a provider updates an embedding model (OpenAI did this with text-embedding-3-small to text-embedding-3-large), embeddings from the new model are not comparable to embeddings from the old. Migrating means re-embedding the entire corpus. For a 10M-vector corpus, that is a meaningful cost in both money and time. Plan it before the provider deprecates the old model on you.
Metadata filter explosion. Vector queries with rich metadata filtering (where category = 'X' AND tags CONTAINS 'Y' AND date > 'Z') interact badly with HNSW. The HNSW graph traversal does not natively respect filters; the engine either applies filters post-graph (which can return fewer than K results) or pre-filters (which can be slow on large filter sets). Test with realistic filter loads, not with empty filters.
Cost surprises on scaling traffic. Pay-per-query managed pricing scales linearly with traffic. A successful product feature that drives 10x query volume drives 10x bill. Set alerts on monthly spend, not just on volume.
The honest workflow
Before picking any vector store:
- Measure expected vector count at year one
- Measure expected query rate, peak and sustained
- Set a target tail latency
- Set a recall target on a labelled question set
- Test embedding dimensionality at 1536, 768, and 512 against the recall target
- Decide replication based on availability requirement, not on default
- Pick the cheapest setup that meets the four constraints
Most teams skip steps 1 through 4 and go straight to step 7 with default values. The result is a bill that is correct for someone else’s workload.
The configurations that win are not exotic. They are the workload measured honestly, and the configuration matched to the measurement. The teams that do this once at sizing time pay the right bill for the entire lifetime of the system. The teams that do not, pay 5x to 10x of the right bill until somebody notices and the audit happens, usually quarters later.
Questions teams ask
Should I self-host or use a managed service?
Below 5 million vectors and a single replica, self-hosted Qdrant or pgvector on a small managed VM is cheaper. Above 10 million vectors with multi-region replication and high QPS, managed services (Pinecone, Weaviate Cloud, Qdrant Cloud) save more in operational cost than they cost in service fees.
Is pgvector production-ready?
Yes for moderate scale. pgvector with HNSW indexing handles tens of millions of vectors with acceptable query latency. Above that, dedicated vector engines (Qdrant, Weaviate) outperform on tail latency. The operational simplicity of staying inside your existing Postgres is real and worth a tier of scale.
How do dimensionality reduction techniques affect quality?
Matryoshka embeddings (where the model is trained to be useful at multiple truncation lengths) lose roughly 1 to 3 percent recall when truncated from 1536 to 768 dimensions. PCA-based reduction loses 5 to 10 percent. Use Matryoshka-trained embeddings if cost matters and you can afford a small recall trade.