Prompt Caching: Where It Pays Back and Where It Does Not
- Cached input tokens cost 10 percent of uncached on most providers in 2026.
- Cache hits require the prompt prefix to be byte-identical. Even a timestamp in the prefix breaks the cache.
- RAG context is a natural cache target if the system prompt and tool list are stable across calls.
- Cache TTL is short (5 minutes typical). Burst patterns benefit; cold-start patterns do not.
A team we audited had been running a customer-support assistant for fourteen months. They were spending roughly 8,000 dollars per month on input tokens. The system prompt was 4,200 tokens, the tool list added 1,800 tokens, and the user-specific context added another 600 to 2,000 tokens per call. Every single one of those calls computed attention over the full prompt from scratch, on every request. Caching had never been enabled.
We added cache-control markers in two hours. The next day, input-token cost dropped from 8,000 dollars per month projected to roughly 1,400 dollars per month. The output-token cost did not change (output is never cached), but it was already a small fraction of the bill. Net monthly saving: 6,500 dollars on a feature that took two engineer-hours to fix.
This piece is about that pattern: when prompt caching pays back, when it does not, and the configuration mistakes that quietly leave the discount on the table.
How prompt caching works
Modern LLM providers offer a tiered pricing model: input tokens are billed at full rate the first time the provider sees them, then at a discounted rate (typically 10 percent of full) on subsequent calls within a TTL window when the same prefix appears.
The caching is server-side. The provider hashes the prefix, stores the computed attention state, and reuses it on cache hits. From the application’s perspective, the input is the same; only the bill changes.
Two providers in 2026 with current pricing roughly:
| Provider | Cache write cost | Cache read cost | TTL |
|---|---|---|---|
| Anthropic Claude | 1.25x base input cost | 0.10x base input cost | 5 minutes (sliding) |
| OpenAI | base input cost | 0.50x base input cost | indeterminate (best-effort) |
| AWS Bedrock | varies by model | varies by model | varies |
| Google Vertex | base input cost | 0.25x base input cost | 60 minutes |
The exact numbers shift; the principle is consistent. Cache hits cost a fraction of cache misses.
What earns the discount
The cache key is the prefix of the prompt up to the cache-control marker. For a hit, the prefix must be byte-identical to a recent call. The pattern that maximises hits:
[Stable system prompt] ← cache-controlled, hits often
[Stable tool list] ← cache-controlled, hits often
[Stable retrieval context if any] ← optional, depends on use case
[CACHE BREAKPOINT]
[Per-call user message] ← changes every call, never cached
[Per-call retrieved documents] ← changes per call
The stable parts go first. The cache-control marker comes after them. The volatile parts go after the marker. Anthropic’s API uses an explicit cache_control field on the message; OpenAI uses automatic prefix matching on consecutive identical prefixes.
# Anthropic-style explicit cache marker
client.messages.create(
model="claude-sonnet-4-6",
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # 4200 tokens
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{"role": "user", "content": user_message},
],
)
For the cache to hit, the next call within the TTL must use the same model, the same exact system prompt text, and the same tool list. If anything in the cached prefix changes, the next call is a cache write (1.25x base cost) instead of a hit.
What breaks the cache silently
Several patterns we have caught in audits that defeated caching without the team realising:
Timestamp in the system prompt. A team had Today is {current_date}. in the system prompt. Every call had a different prefix. Cache hit rate: zero. Fix: move the timestamp to the user message section, after the cache marker.
User-specific personalisation in the prefix. “You are an assistant for {user_name}.” Each user has a different prefix. Cache hits only repeat for the same user. Fix: pass user info as part of the per-call message, not in the system prompt.
Tool list permutation. The team’s framework was sorting tools by name on each call, but a recent refactor changed the sort to “by recently used”, randomising the order. Each permutation was a different prefix. Fix: stable sort, ideally alphabetical.
Model upgrade. Caches are per-model. Upgrading from claude-sonnet-4-5 to claude-sonnet-4-6 invalidates all caches. Plan for the cold-start cost on upgrade days.
TTL expiry under low traffic. The cache TTL is short (5 minutes on Anthropic, typically). Low-traffic periods miss the cache because too much time passed between calls. Fix: keep-alive calls during expected gaps if traffic is bursty.
Inconsistent whitespace. A formatter that sometimes added a trailing newline and sometimes did not. Each version was a different prefix. Cache hits halved. Fix: deterministic prefix construction, ideally as a constant string.
Where caching pays back the most
The savings scale with three factors: prefix size, traffic frequency, and prefix stability. The product of all three is roughly the percent of input cost you can recover.
| Use case | Prefix tokens | Traffic frequency | Stability | Typical saving |
|---|---|---|---|---|
| Customer support assistant (single product) | 4,000 to 8,000 | constant high | very stable | 70 to 85% on input cost |
| RAG over fixed corpus | 6,000 to 20,000 (corpus pages) | varies | stable | 60 to 80% on input cost |
| Coding assistant (system prompt + tools) | 3,000 to 6,000 | constant during work hours | stable | 50 to 70% on input cost |
| One-off batch processing | 2,000 to 5,000 | infrequent, in bursts | varies | 30 to 60% on input cost |
| Per-user personalised assistant | unknown | varies | unstable per user | 0 to 30% on input cost |
The first three are slam-dunks. The last is where caching does not pay back; if the prefix changes per user, the cache cannot share across users.
Where caching does NOT pay back
Per-call dynamic system prompts. If the system prompt is generated fresh per request (rare but happens in some agent designs), nothing is cacheable.
Very low traffic. If a prompt is called 200 times a month, the cache rarely hits because the TTL expires between calls. The savings are real but the absolute dollars are small; not worth the engineering effort.
Output-heavy workloads. Caching only discounts INPUT tokens. If your prompt is short (200 input tokens) and produces long output (2000 tokens), the input is already a tiny fraction of the bill. Caching saves cents per call, not dollars.
Multi-tenant prompt isolation. Some compliance contexts require that one tenant’s data never appears in another tenant’s request, even via shared cache attention state. In those contexts, caching may need to be scoped per-tenant, which limits hit rates.
What we install on engagements
Roughly 90% of engagements with significant LLM spend benefit from caching. The standard install:
- Audit the prompt mix and prefix sizes (one day)
- Identify the top 3 to 5 prompts by total token spend
- Reorder each prompt so stable content is first, volatile content last
- Add cache-control markers per provider’s API
- Add observability on cache hit rate per prompt
Total: typically half an engineer-week. Pays back in the first month for any engagement with non-trivial LLM spend.
The savings are quiet because they show up in the bill, not in the product. The engineering team rarely gets credit for them. The CFO does. We measure them as a percent reduction in input-token cost month-over-month and report them up; otherwise the work is invisible until budget review surfaces a question.
Caching is the easiest cost lever in the LLM stack and the most commonly missed. The work is small. The discipline of “stable prefix first, volatile content last, cache-control marker between them” pays back forever.
Questions teams ask
Do all providers support prompt caching?
Most major providers (Anthropic, OpenAI, Google, AWS Bedrock) offer cache-aware pricing. Mechanics differ slightly, costs differ slightly, the principle is the same. Check your provider's documentation for current TTL and discount rates.
Does caching change the model output?
No. The cache is a server-side optimisation that lets the provider skip recomputing attention over identical prefixes. Output is identical to a non-cached call with the same input.
What invalidates the cache?
Any change to the cached prefix, including a single character, breaks the cache. Tool list changes, system prompt edits, model upgrades, and time-window expiry (typically 5 minutes since last hit on most providers in 2026).