Context Window Economics: The Hidden Bill on RAG
- A typical RAG call spends 60 to 80 percent of its tokens on retrieved context that the answer ignored.
- Cross-encoder reranking on retrieved chunks cuts useful context to roughly 3 chunks at no quality loss.
- Long-context models do not solve this problem. They make it more expensive.
- Per-prompt token budgets are an operational discipline, not a model feature.
A team running a RAG system in production was sending 12 retrieved chunks per call to a long-context model. Each chunk averaged 800 tokens. The system prompt was another 2,500 tokens. Per-call input cost was 12,100 tokens, output was around 400 tokens. With pricing at the time, each call cost roughly 4.2 cents. They were doing 1.4 million queries per month; the LLM bill was 58,800 dollars per month.
We measured how much of the retrieved context the model actually used in producing the answer. A representative sample of 200 calls showed: on average, 2.4 chunks contained the information that ended up in the answer. The other 9.6 chunks were noise, paid for and ignored.
We installed cross-encoder reranking that selected the top 3 chunks from a top-50 retrieval. Per-call input dropped to 4,800 tokens. Quality measured by their eval suite improved by 4 percent (less context dilution). Monthly cost dropped to 23,200 dollars. Same product, same retrieval index, same LLM, half the bill.
This piece is about that discipline: context-window economics in RAG and the patterns that keep the bill predictable as usage scales.
The math, briefly
Every input token costs money. A long-context call multiplies the per-call cost by however many tokens you stuff in. Context window size is not free capacity; it is a metered resource priced per use.
A typical RAG architecture has three sources of input tokens:
- System prompt and tool list: stable across calls, cacheable.
- Retrieved context: variable per call, scaled by chunk count and chunk size.
- User message and chat history: variable, usually small.
Of these, the retrieved context is by far the largest variable cost driver. A team that sends “more context to be safe” pays linearly for the safety margin. A team that sends “the right context” pays for what was useful.
| Setting | Input tokens / call | Cost / 1M calls (Sonnet 4.6) |
|---|---|---|
| Naive: system + 12 chunks @ 800 tok | 12,100 | $36,300 |
| Reranked: system + 3 chunks @ 600 tok (compressed) | 4,300 | $12,900 |
| Aggressive: system + 2 chunks @ 400 tok (top-2 + summarisation) | 3,300 | $9,900 |
Same product, same answers (when measured against the eval suite), 3.7x cost ratio between naive and aggressive.
The first lever: rerank before you send
Vector search returns chunks ranked by approximate similarity. The top result is often the right answer; the 12th result is usually noise. But vector similarity is approximate. The “right” chunk might be at position 8, and the team’s instinct is to send all 12 to be safe.
Cross-encoder reranking solves this. The reranker is a small model (a BERT-class encoder, typically 100M to 400M parameters) that takes the query and a chunk together and produces a relevance score. The reranker is much more accurate than the vector similarity but cannot be used for the initial retrieval (too slow to score 1M chunks). The two-step pattern: vector retrieval to top-K (50 to 100), reranker to final top-N (3 to 5).
# Two-step retrieval pattern
candidates = vector_store.search(query_embedding, top_k=50)
scored = reranker.score(query, [c.text for c in candidates])
top_chunks = sorted(zip(candidates, scored), key=lambda x: -x[1])[:3]
context = "\n---\n".join(c.text for c, _ in top_chunks)
The reranker adds 50 to 200 ms per call (depending on hardware and number of candidates). The savings on input tokens to the LLM dwarf the reranker cost in almost all cases.
The second lever: truncate the chunks
Even after reranking, individual chunks may contain irrelevant prose. A 1500-word document chunk might have one paragraph that answers the question; the other six paragraphs are filler that costs tokens.
Three approaches:
Smaller chunk size at indexing time. Index the corpus in 200-token chunks instead of 800-token chunks. Retrieval becomes more precise; less padding per chunk. Trade-off: you may lose context that spans chunk boundaries. Hybrid: 400-token chunks with 100-token overlap is a common middle ground.
Sentence-level filtering after retrieval. A small model (or even keyword matching) selects only the sentences in each chunk that are relevant to the query. Cuts chunk size by 50 to 80 percent on average. Adds 20 to 100 ms.
LLM-based summarisation. A cheap LLM call summarises the chunk into a 1-2 sentence relevant excerpt. The most aggressive option; adds another LLM call and its associated latency. Worth it for very long chunks or when the downstream LLM is expensive.
The order to try them: index smaller chunks first (architectural fix). If that is not enough, add sentence filtering. Reserve summarisation for the corner cases where chunks are unavoidably long.
The third lever: per-prompt token budgets
A discipline more than a technical mechanism. For each production prompt, set an explicit input-token budget. Enforce it with code: if the assembled context exceeds the budget, truncate or refuse. Surface the budget in observability so violations are visible.
Without a budget, context grows quietly. A team adds three more chunks “just in case”, a developer extends the system prompt by 800 tokens for a new feature, a refactor inflates the per-call payload. None of these are visible per-commit; the bill creeps up over months.
With a budget, every change to the prompt or retrieval is forced to ask “does this fit”. The conversation shifts from “should we add this” to “what do we remove to add this”, which is a much healthier conversation.
A working budget pattern:
PROMPT_BUDGETS = {
"customer-summary": 4000, # tokens
"research-assistant": 8000,
"code-completion": 2500,
}
def assemble_context(prompt_id, system, chunks, user_message):
budget = PROMPT_BUDGETS[prompt_id]
base = count_tokens(system) + count_tokens(user_message)
chunk_budget = budget - base
selected = []
used = 0
for chunk in chunks: # already reranked
chunk_size = count_tokens(chunk)
if used + chunk_size > chunk_budget:
break
selected.append(chunk)
used += chunk_size
return system + "\n".join(selected) + user_message
The budget per prompt is set at design time and revisited quarterly.
Where long-context models change the calculus
Long-context models (200k, 1M, 2M token windows) do not eliminate context economics. They change the trade-offs:
Pro: Less aggressive truncation. You can include more chunks if quality requires. Some workflows (whole-document analysis, multi-document comparison) become possible that were not before.
Con: The cost per token is the same or higher than the standard-context version of the same model. Filling a 200k window costs 200k worth of input tokens. Long-context is a capability, not a discount.
Subtle con: Quality degrades past a certain context length even on long-context models. Most studies in 2026 show meaningful retrieval degradation past 30k to 50k tokens of in-context content. “Lost in the middle” is real; the model attends less to content in the middle of a long context.
The right use of long-context models: when the use case genuinely needs them (whole-document reasoning) and the quality benefit justifies the cost. The wrong use: as a substitute for retrieval discipline.
What this looks like in production
A well-tuned RAG system in 2026 typically sends:
- 1 system prompt (cached, 3k to 6k tokens)
- 1 tool list if applicable (cached, 1k to 3k tokens)
- 2 to 4 reranked, possibly truncated chunks (1k to 3k tokens uncached)
- The user message (small)
Total per call: under 10k tokens of input on average. Output: 200 to 800 tokens depending on use case.
Compare to a naive RAG system: 30k to 50k tokens per call. The production-tuned version is 3 to 5 times cheaper for the same answers.
What we install on engagements
Standard discipline:
- Profile current input-token spend per prompt (one day)
- Install cross-encoder reranking (one to two engineer-days)
- Add per-prompt token budgets enforced in code (one day)
- Add observability on tokens-per-call distribution (one day)
- Quarterly re-audit (process)
Total: roughly one engineer-week. Pays back in the first month for any team with significant RAG traffic.
Context economics is invisible until the bill surfaces it. The teams that engineer for it early pay sensible amounts as they scale. The teams that ignore it produce invoice surprises that turn into emergency cost-cutting projects six months later. The work is small. The lesson is consistent.
Questions teams ask
Doesn't a long-context model just fix this?
No. Long-context models accept more tokens, but you pay for every one of them. A 200k-token call to a long-context model is more expensive than a focused 8k-token call to the same model. Context length is a capability, not a cost saver.
What ranking method is best?
Cross-encoder reranking (a small BERT-class model that takes query and chunk together) is the highest-precision option. It costs 5 to 50 ms per chunk depending on hardware. For most production RAG, rerank top-50 from retrieval down to top-3 to top-5 for the LLM context.
Should I summarise retrieved chunks before sending?
Sometimes. Summarisation costs an extra LLM call, which is not free. Worth it when the chunks are long (over 500 tokens each) and the question is narrow. Not worth it when chunks are short or when the LLM call is already cheap.