Vector Store Sizing: The Cost Truth Nobody Tells You

Data Platform vector-database, rag

What a million vectors actually costs across Pinecone, Qdrant, Weaviate and pgvector, and the configuration choices that move the bill by 5x.

By Orzed Team
6 November 2025
6 min read

Key takeaways

Embedding dimensionality is the largest single cost driver. Cutting from 1536 to 768 dimensions halves storage and roughly halves query cost.
Replication factor has a 2x to 3x cost multiplier and is often set higher than the workload needs.
Hybrid (BM25 + dense) costs more than dense-only but lifts recall enough to be worth it on heterogeneous corpora.
Self-hosted on a small VM beats managed services below roughly 5 million vectors. Above that, managed wins on operational cost.

The first time we sized a vector store on an engagement, we asked the team what their projected query volume would be. They had not measured it. We asked what their target latency was. They had not set one. We asked what dimensionality their embeddings used. They knew this one (1536, the OpenAI default). We then quoted a managed Pinecone bill that made the room go quiet.

After two weeks of profiling, the right answer was a 4-vCPU Qdrant instance on a 100-dollar-per-month VM. The cost difference was about 12x. None of it was about the database brand; it was about the workload assumptions the team had not measured.

This piece is the configuration arithmetic that moves vector store bills by an order of magnitude. The honest cost models we have measured across vendors, and the levers that matter regardless of which one you pick.

What a million vectors actually costs

For a baseline workload (1 million vectors, 1536 dimensions, 10 queries per second peak, 100ms target latency, single region), here is what we typically see in 2026 pricing:

Setup	Storage cost	Query cost (at 10 QPS sustained)	Notes
Pinecone serverless (1536d)	~$70/mo	~$120/mo	Pay-per-read, scales with traffic
Pinecone p1.x1 pod	~$70/mo	included	Fixed pod cost, scales with replica count
Weaviate Cloud (sandbox tier)	$25/mo	included	Tier breaks ~5M vectors
Qdrant Cloud (1GB cluster)	~$60/mo	included	Sufficient for 1-2M vectors at 1536d
Self-hosted Qdrant on $80 VM	$80/mo	included	4 vCPU, 16GB RAM handles this easily
Self-hosted pgvector on existing Postgres	$0 marginal	~$0	If Postgres already running

Three things change these numbers significantly:

Dimensionality. Each vector at 1536 dimensions in float32 takes 6KB. At 768 dimensions, 3KB. At 384 dimensions, 1.5KB. Storage scales linearly. Query cost scales sublinearly but still meaningfully. Cutting dimensionality in half is the single highest-leverage cost lever.

Replication factor. A single replica handles read traffic up to its limits. Two replicas double the bill. High availability often calls for three replicas across zones. We routinely see teams running three replicas on workloads where one would meet the SLO; the team set up HA by default without measuring whether the workload needed it.

Index type. HNSW (hierarchical navigable small world) is fast at query time but builds a memory-resident index roughly 1.5x to 2x the size of the raw vectors. IVF (inverted file with flat compression) trades query speed for lower memory. For latency-sensitive workloads HNSW is the right default; for cost-sensitive workloads with tolerable latency, IVF can cut memory usage substantially.

The three configuration levers that matter most

Lever 1: dimensionality. Most teams use the default dimensionality of whatever embedding model they picked first (typically 1536 from OpenAI). Almost nobody re-evaluates after the first month. Measuring recall@5 at multiple dimensionalities on a labelled question set takes an afternoon and almost always finds that lower dimensions are usable.

For a corpus we audited last quarter, recall@5 was:

Dimensionality	Recall@5	Storage per million vectors
1536	0.91	6.0 GB
1024	0.90	4.0 GB
768	0.89	3.0 GB
512	0.85	2.0 GB
256	0.78	1.0 GB

The team had been running 1536d “to be safe.” Moving to 768d cost them less than 2% recall and saved 50% of storage. Combined with the lower query cost (vector compare scales with dimensions), the monthly bill dropped 45%.

The model used must support truncation cleanly. OpenAI’s text-embedding-3-large is Matryoshka-trained, which means it loses minimal quality when truncated. ada-002 is not, and truncation hurts more there.

Lever 2: replication factor. A single replica is enough for many production workloads. Two replicas double cost. Three replicas (the typical HA default) triple it.

The honest test for replication is: what is your tolerated unavailability for the vector store specifically, and how often do you actually deploy or restart it? A vector store that is part of a customer-facing path probably needs two replicas. A vector store that backs an internal search tool probably needs one. A vector store that backs a batch job that runs nightly genuinely does not need any replication; failover can be a manual restart.

Lever 3: hybrid versus dense-only retrieval. Hybrid search (BM25 + dense) typically lifts recall by 5 to 15 percent on heterogeneous corpora (mixed query types, exact-match needs alongside semantic). It also doubles the per-query cost: two index lookups instead of one, plus rank fusion.

The trade is worth it when:

Queries include identifiers, product codes, or specific names where exact match matters
The corpus has high lexical diversity
A few percent recall gain is worth the cost increase

The trade is not worth it when:

All queries are conversational (“how do I do X”)
The corpus is internally consistent in language
Dense-only already exceeds your recall target

We default to hybrid for general-purpose RAG and dense-only for narrow conversational corpora.

Self-hosted versus managed: where the line is

The most common over-spend pattern is paying for a managed vector store when self-hosted would meet the workload at a fraction of the cost.

The honest line we see in 2026:

Workload	Self-hosted recommended	Managed recommended
< 1M vectors, low traffic	pgvector or single Qdrant container	Overkill
1-10M vectors, low to moderate traffic	Qdrant on a $100-300/mo VM	Optional
10-50M vectors, multi-region	Possible but operational cost rises	Often worth it
> 50M vectors, high QPS, multi-region	Possible if dedicated team	Strongly recommended

Self-hosted costs are dominated by the VM bill plus engineering time to operate. Managed costs are dominated by service fees but include the operational layer. The crossover depends on what you value more: the cash bill or the engineer-week per quarter spent on backups, upgrades and incident response.

For a team without dedicated platform engineering, the operational cost of self-hosted is often higher than it looks. The team’s senior engineer spends two days a quarter on routine vector-store maintenance; if that engineer’s time is worth more than the managed service’s premium, managed wins on real cost even though self-hosted wins on cash cost.

Operational gotchas we have seen

Index rebuilds during traffic. HNSW rebuilds are CPU-heavy and can spike query latency for hours. If your vector store is single-replica and you rebuild during business hours, you have an outage. Rebuild during off-peak or use a multi-replica setup with rolling rebuild.

Embedding drift on model updates. When a provider updates an embedding model (OpenAI did this with text-embedding-3-small to text-embedding-3-large), embeddings from the new model are not comparable to embeddings from the old. Migrating means re-embedding the entire corpus. For a 10M-vector corpus, that is a meaningful cost in both money and time. Plan it before the provider deprecates the old model on you.

Metadata filter explosion. Vector queries with rich metadata filtering (where category = 'X' AND tags CONTAINS 'Y' AND date > 'Z') interact badly with HNSW. The HNSW graph traversal does not natively respect filters; the engine either applies filters post-graph (which can return fewer than K results) or pre-filters (which can be slow on large filter sets). Test with realistic filter loads, not with empty filters.

Cost surprises on scaling traffic. Pay-per-query managed pricing scales linearly with traffic. A successful product feature that drives 10x query volume drives 10x bill. Set alerts on monthly spend, not just on volume.

The honest workflow

Before picking any vector store:

Measure expected vector count at year one
Measure expected query rate, peak and sustained
Set a target tail latency
Set a recall target on a labelled question set
Test embedding dimensionality at 1536, 768, and 512 against the recall target
Decide replication based on availability requirement, not on default
Pick the cheapest setup that meets the four constraints

Most teams skip steps 1 through 4 and go straight to step 7 with default values. The result is a bill that is correct for someone else’s workload.

The configurations that win are not exotic. They are the workload measured honestly, and the configuration matched to the measurement. The teams that do this once at sizing time pay the right bill for the entire lifetime of the system. The teams that do not, pay 5x to 10x of the right bill until somebody notices and the audit happens, usually quarters later.

Frequently asked

Questions teams ask

Should I self-host or use a managed service?

Below 5 million vectors and a single replica, self-hosted Qdrant or pgvector on a small managed VM is cheaper. Above 10 million vectors with multi-region replication and high QPS, managed services (Pinecone, Weaviate Cloud, Qdrant Cloud) save more in operational cost than they cost in service fees.

Is pgvector production-ready?

Yes for moderate scale. pgvector with HNSW indexing handles tens of millions of vectors with acceptable query latency. Above that, dedicated vector engines (Qdrant, Weaviate) outperform on tail latency. The operational simplicity of staying inside your existing Postgres is real and worth a tier of scale.

How do dimensionality reduction techniques affect quality?

Matryoshka embeddings (where the model is trained to be useful at multiple truncation lengths) lose roughly 1 to 3 percent recall when truncated from 1536 to 768 dimensions. PCA-based reduction loses 5 to 10 percent. Use Matryoshka-trained embeddings if cost matters and you can afford a small recall trade.

Artificial Intelligence

Machine Learning

Data Engineering

Computer Vision

Deep Learning

Natural Language Processing

MLOps & Governance

Cyber Security & Risk Ops

Technology Stack

AI Integration

SaaS Product Development

E-Commerce & Marketplace

Growth Analytics & SEO/GEO

Mobile App Development

Web & Content Platforms

CRM & Revenue Operations

Code & Performance Refactoring

Financial Technology

Healthcare & MedTech

E-Commerce & Retail

Manufacturing & Industrial

Media & Publishing

Education & EdTech

Real Estate & PropTech

Logistics & Supply Chain

Energy & Sustainability

Project Management

Product Strategy

DevOps & Cloud Infrastructure

Enterprise Workflow Automation

Business Intelligence

QA & Release Governance

UX/UI Systems & Design

Change Management & Transformation

Portfolio Management