Context Window Economics: The Hidden Bill on RAG

LLM Cost Engineering context-window, rag

Stuffing context costs money on every call, even on tokens the model ignored. The discipline that keeps RAG context relevant, ranked and compressed.

  • By Orzed Team
  • 6 min read
Key takeaways
  • A typical RAG call spends 60 to 80 percent of its tokens on retrieved context that the answer ignored.
  • Cross-encoder reranking on retrieved chunks cuts useful context to roughly 3 chunks at no quality loss.
  • Long-context models do not solve this problem. They make it more expensive.
  • Per-prompt token budgets are an operational discipline, not a model feature.

A team running a RAG system in production was sending 12 retrieved chunks per call to a long-context model. Each chunk averaged 800 tokens. The system prompt was another 2,500 tokens. Per-call input cost was 12,100 tokens, output was around 400 tokens. With pricing at the time, each call cost roughly 4.2 cents. They were doing 1.4 million queries per month; the LLM bill was 58,800 dollars per month.

We measured how much of the retrieved context the model actually used in producing the answer. A representative sample of 200 calls showed: on average, 2.4 chunks contained the information that ended up in the answer. The other 9.6 chunks were noise, paid for and ignored.

We installed cross-encoder reranking that selected the top 3 chunks from a top-50 retrieval. Per-call input dropped to 4,800 tokens. Quality measured by their eval suite improved by 4 percent (less context dilution). Monthly cost dropped to 23,200 dollars. Same product, same retrieval index, same LLM, half the bill.

This piece is about that discipline: context-window economics in RAG and the patterns that keep the bill predictable as usage scales.

The math, briefly

Every input token costs money. A long-context call multiplies the per-call cost by however many tokens you stuff in. Context window size is not free capacity; it is a metered resource priced per use.

A typical RAG architecture has three sources of input tokens:

  1. System prompt and tool list: stable across calls, cacheable.
  2. Retrieved context: variable per call, scaled by chunk count and chunk size.
  3. User message and chat history: variable, usually small.

Of these, the retrieved context is by far the largest variable cost driver. A team that sends “more context to be safe” pays linearly for the safety margin. A team that sends “the right context” pays for what was useful.

SettingInput tokens / callCost / 1M calls (Sonnet 4.6)
Naive: system + 12 chunks @ 800 tok12,100$36,300
Reranked: system + 3 chunks @ 600 tok (compressed)4,300$12,900
Aggressive: system + 2 chunks @ 400 tok (top-2 + summarisation)3,300$9,900

Same product, same answers (when measured against the eval suite), 3.7x cost ratio between naive and aggressive.

The first lever: rerank before you send

Vector search returns chunks ranked by approximate similarity. The top result is often the right answer; the 12th result is usually noise. But vector similarity is approximate. The “right” chunk might be at position 8, and the team’s instinct is to send all 12 to be safe.

Cross-encoder reranking solves this. The reranker is a small model (a BERT-class encoder, typically 100M to 400M parameters) that takes the query and a chunk together and produces a relevance score. The reranker is much more accurate than the vector similarity but cannot be used for the initial retrieval (too slow to score 1M chunks). The two-step pattern: vector retrieval to top-K (50 to 100), reranker to final top-N (3 to 5).

# Two-step retrieval pattern
candidates = vector_store.search(query_embedding, top_k=50)
scored = reranker.score(query, [c.text for c in candidates])
top_chunks = sorted(zip(candidates, scored), key=lambda x: -x[1])[:3]
context = "\n---\n".join(c.text for c, _ in top_chunks)

The reranker adds 50 to 200 ms per call (depending on hardware and number of candidates). The savings on input tokens to the LLM dwarf the reranker cost in almost all cases.

The second lever: truncate the chunks

Even after reranking, individual chunks may contain irrelevant prose. A 1500-word document chunk might have one paragraph that answers the question; the other six paragraphs are filler that costs tokens.

Three approaches:

Smaller chunk size at indexing time. Index the corpus in 200-token chunks instead of 800-token chunks. Retrieval becomes more precise; less padding per chunk. Trade-off: you may lose context that spans chunk boundaries. Hybrid: 400-token chunks with 100-token overlap is a common middle ground.

Sentence-level filtering after retrieval. A small model (or even keyword matching) selects only the sentences in each chunk that are relevant to the query. Cuts chunk size by 50 to 80 percent on average. Adds 20 to 100 ms.

LLM-based summarisation. A cheap LLM call summarises the chunk into a 1-2 sentence relevant excerpt. The most aggressive option; adds another LLM call and its associated latency. Worth it for very long chunks or when the downstream LLM is expensive.

The order to try them: index smaller chunks first (architectural fix). If that is not enough, add sentence filtering. Reserve summarisation for the corner cases where chunks are unavoidably long.

The third lever: per-prompt token budgets

A discipline more than a technical mechanism. For each production prompt, set an explicit input-token budget. Enforce it with code: if the assembled context exceeds the budget, truncate or refuse. Surface the budget in observability so violations are visible.

Without a budget, context grows quietly. A team adds three more chunks “just in case”, a developer extends the system prompt by 800 tokens for a new feature, a refactor inflates the per-call payload. None of these are visible per-commit; the bill creeps up over months.

With a budget, every change to the prompt or retrieval is forced to ask “does this fit”. The conversation shifts from “should we add this” to “what do we remove to add this”, which is a much healthier conversation.

A working budget pattern:

PROMPT_BUDGETS = {
    "customer-summary": 4000,         # tokens
    "research-assistant": 8000,
    "code-completion": 2500,
}

def assemble_context(prompt_id, system, chunks, user_message):
    budget = PROMPT_BUDGETS[prompt_id]
    base = count_tokens(system) + count_tokens(user_message)
    chunk_budget = budget - base
    selected = []
    used = 0
    for chunk in chunks:  # already reranked
        chunk_size = count_tokens(chunk)
        if used + chunk_size > chunk_budget:
            break
        selected.append(chunk)
        used += chunk_size
    return system + "\n".join(selected) + user_message

The budget per prompt is set at design time and revisited quarterly.

Where long-context models change the calculus

Long-context models (200k, 1M, 2M token windows) do not eliminate context economics. They change the trade-offs:

Pro: Less aggressive truncation. You can include more chunks if quality requires. Some workflows (whole-document analysis, multi-document comparison) become possible that were not before.

Con: The cost per token is the same or higher than the standard-context version of the same model. Filling a 200k window costs 200k worth of input tokens. Long-context is a capability, not a discount.

Subtle con: Quality degrades past a certain context length even on long-context models. Most studies in 2026 show meaningful retrieval degradation past 30k to 50k tokens of in-context content. “Lost in the middle” is real; the model attends less to content in the middle of a long context.

The right use of long-context models: when the use case genuinely needs them (whole-document reasoning) and the quality benefit justifies the cost. The wrong use: as a substitute for retrieval discipline.

What this looks like in production

A well-tuned RAG system in 2026 typically sends:

  • 1 system prompt (cached, 3k to 6k tokens)
  • 1 tool list if applicable (cached, 1k to 3k tokens)
  • 2 to 4 reranked, possibly truncated chunks (1k to 3k tokens uncached)
  • The user message (small)

Total per call: under 10k tokens of input on average. Output: 200 to 800 tokens depending on use case.

Compare to a naive RAG system: 30k to 50k tokens per call. The production-tuned version is 3 to 5 times cheaper for the same answers.

What we install on engagements

Standard discipline:

  1. Profile current input-token spend per prompt (one day)
  2. Install cross-encoder reranking (one to two engineer-days)
  3. Add per-prompt token budgets enforced in code (one day)
  4. Add observability on tokens-per-call distribution (one day)
  5. Quarterly re-audit (process)

Total: roughly one engineer-week. Pays back in the first month for any team with significant RAG traffic.

Context economics is invisible until the bill surfaces it. The teams that engineer for it early pay sensible amounts as they scale. The teams that ignore it produce invoice surprises that turn into emergency cost-cutting projects six months later. The work is small. The lesson is consistent.

Frequently asked

Questions teams ask

Doesn't a long-context model just fix this?

No. Long-context models accept more tokens, but you pay for every one of them. A 200k-token call to a long-context model is more expensive than a focused 8k-token call to the same model. Context length is a capability, not a cost saver.

What ranking method is best?

Cross-encoder reranking (a small BERT-class model that takes query and chunk together) is the highest-precision option. It costs 5 to 50 ms per chunk depending on hardware. For most production RAG, rerank top-50 from retrieval down to top-3 to top-5 for the LLM context.

Should I summarise retrieved chunks before sending?

Sometimes. Summarisation costs an extra LLM call, which is not free. Worth it when the chunks are long (over 500 tokens each) and the question is narrow. Not worth it when chunks are short or when the LLM call is already cheap.