Prompt Caching: Where It Pays Back and Where It Does Not

LLM Cost Engineering prompt-caching, llm-cost

Provider-side prompt caching cuts cached input cost by up to 90 percent. The pattern that earns the discount and the configurations that waste it.

  • By Orzed Team
  • 5 min read
Key takeaways
  • Cached input tokens cost 10 percent of uncached on most providers in 2026.
  • Cache hits require the prompt prefix to be byte-identical. Even a timestamp in the prefix breaks the cache.
  • RAG context is a natural cache target if the system prompt and tool list are stable across calls.
  • Cache TTL is short (5 minutes typical). Burst patterns benefit; cold-start patterns do not.

A team we audited had been running a customer-support assistant for fourteen months. They were spending roughly 8,000 dollars per month on input tokens. The system prompt was 4,200 tokens, the tool list added 1,800 tokens, and the user-specific context added another 600 to 2,000 tokens per call. Every single one of those calls computed attention over the full prompt from scratch, on every request. Caching had never been enabled.

We added cache-control markers in two hours. The next day, input-token cost dropped from 8,000 dollars per month projected to roughly 1,400 dollars per month. The output-token cost did not change (output is never cached), but it was already a small fraction of the bill. Net monthly saving: 6,500 dollars on a feature that took two engineer-hours to fix.

This piece is about that pattern: when prompt caching pays back, when it does not, and the configuration mistakes that quietly leave the discount on the table.

How prompt caching works

Modern LLM providers offer a tiered pricing model: input tokens are billed at full rate the first time the provider sees them, then at a discounted rate (typically 10 percent of full) on subsequent calls within a TTL window when the same prefix appears.

The caching is server-side. The provider hashes the prefix, stores the computed attention state, and reuses it on cache hits. From the application’s perspective, the input is the same; only the bill changes.

Two providers in 2026 with current pricing roughly:

ProviderCache write costCache read costTTL
Anthropic Claude1.25x base input cost0.10x base input cost5 minutes (sliding)
OpenAIbase input cost0.50x base input costindeterminate (best-effort)
AWS Bedrockvaries by modelvaries by modelvaries
Google Vertexbase input cost0.25x base input cost60 minutes

The exact numbers shift; the principle is consistent. Cache hits cost a fraction of cache misses.

What earns the discount

The cache key is the prefix of the prompt up to the cache-control marker. For a hit, the prefix must be byte-identical to a recent call. The pattern that maximises hits:

[Stable system prompt]                  ← cache-controlled, hits often
[Stable tool list]                      ← cache-controlled, hits often
[Stable retrieval context if any]       ← optional, depends on use case
[CACHE BREAKPOINT]
[Per-call user message]                 ← changes every call, never cached
[Per-call retrieved documents]          ← changes per call

The stable parts go first. The cache-control marker comes after them. The volatile parts go after the marker. Anthropic’s API uses an explicit cache_control field on the message; OpenAI uses automatic prefix matching on consecutive identical prefixes.

# Anthropic-style explicit cache marker
client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,        # 4200 tokens
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {"role": "user", "content": user_message},
    ],
)

For the cache to hit, the next call within the TTL must use the same model, the same exact system prompt text, and the same tool list. If anything in the cached prefix changes, the next call is a cache write (1.25x base cost) instead of a hit.

What breaks the cache silently

Several patterns we have caught in audits that defeated caching without the team realising:

Timestamp in the system prompt. A team had Today is {current_date}. in the system prompt. Every call had a different prefix. Cache hit rate: zero. Fix: move the timestamp to the user message section, after the cache marker.

User-specific personalisation in the prefix. “You are an assistant for {user_name}.” Each user has a different prefix. Cache hits only repeat for the same user. Fix: pass user info as part of the per-call message, not in the system prompt.

Tool list permutation. The team’s framework was sorting tools by name on each call, but a recent refactor changed the sort to “by recently used”, randomising the order. Each permutation was a different prefix. Fix: stable sort, ideally alphabetical.

Model upgrade. Caches are per-model. Upgrading from claude-sonnet-4-5 to claude-sonnet-4-6 invalidates all caches. Plan for the cold-start cost on upgrade days.

TTL expiry under low traffic. The cache TTL is short (5 minutes on Anthropic, typically). Low-traffic periods miss the cache because too much time passed between calls. Fix: keep-alive calls during expected gaps if traffic is bursty.

Inconsistent whitespace. A formatter that sometimes added a trailing newline and sometimes did not. Each version was a different prefix. Cache hits halved. Fix: deterministic prefix construction, ideally as a constant string.

Where caching pays back the most

The savings scale with three factors: prefix size, traffic frequency, and prefix stability. The product of all three is roughly the percent of input cost you can recover.

Use casePrefix tokensTraffic frequencyStabilityTypical saving
Customer support assistant (single product)4,000 to 8,000constant highvery stable70 to 85% on input cost
RAG over fixed corpus6,000 to 20,000 (corpus pages)variesstable60 to 80% on input cost
Coding assistant (system prompt + tools)3,000 to 6,000constant during work hoursstable50 to 70% on input cost
One-off batch processing2,000 to 5,000infrequent, in burstsvaries30 to 60% on input cost
Per-user personalised assistantunknownvariesunstable per user0 to 30% on input cost

The first three are slam-dunks. The last is where caching does not pay back; if the prefix changes per user, the cache cannot share across users.

Where caching does NOT pay back

Per-call dynamic system prompts. If the system prompt is generated fresh per request (rare but happens in some agent designs), nothing is cacheable.

Very low traffic. If a prompt is called 200 times a month, the cache rarely hits because the TTL expires between calls. The savings are real but the absolute dollars are small; not worth the engineering effort.

Output-heavy workloads. Caching only discounts INPUT tokens. If your prompt is short (200 input tokens) and produces long output (2000 tokens), the input is already a tiny fraction of the bill. Caching saves cents per call, not dollars.

Multi-tenant prompt isolation. Some compliance contexts require that one tenant’s data never appears in another tenant’s request, even via shared cache attention state. In those contexts, caching may need to be scoped per-tenant, which limits hit rates.

What we install on engagements

Roughly 90% of engagements with significant LLM spend benefit from caching. The standard install:

  1. Audit the prompt mix and prefix sizes (one day)
  2. Identify the top 3 to 5 prompts by total token spend
  3. Reorder each prompt so stable content is first, volatile content last
  4. Add cache-control markers per provider’s API
  5. Add observability on cache hit rate per prompt

Total: typically half an engineer-week. Pays back in the first month for any engagement with non-trivial LLM spend.

The savings are quiet because they show up in the bill, not in the product. The engineering team rarely gets credit for them. The CFO does. We measure them as a percent reduction in input-token cost month-over-month and report them up; otherwise the work is invisible until budget review surfaces a question.

Caching is the easiest cost lever in the LLM stack and the most commonly missed. The work is small. The discipline of “stable prefix first, volatile content last, cache-control marker between them” pays back forever.

Frequently asked

Questions teams ask

Do all providers support prompt caching?

Most major providers (Anthropic, OpenAI, Google, AWS Bedrock) offer cache-aware pricing. Mechanics differ slightly, costs differ slightly, the principle is the same. Check your provider's documentation for current TTL and discount rates.

Does caching change the model output?

No. The cache is a server-side optimisation that lets the provider skip recomputing attention over identical prefixes. Output is identical to a non-cached call with the same input.

What invalidates the cache?

Any change to the cached prefix, including a single character, breaks the cache. Tool list changes, system prompt edits, model upgrades, and time-window expiry (typically 5 minutes since last hit on most providers in 2026).