LLM Cost Routing: The Cheapest Passing Model Pattern
- Single-model architectures pay flagship prices for tasks that smaller models would handle.
- Per-prompt eval bars decide which model is the cheapest passing one. Without evals, routing is guessing.
- A simple rules-based router catches most of the savings; a learned router catches a few percent more at significant complexity cost.
- Re-evaluate routing decisions quarterly. New models change the equation regularly.
A team we worked with had built an AI feature that called GPT-4 for every operation. Their monthly inference bill had grown from 800 dollars at launch to 14,000 dollars after nine months as usage scaled. They were considering price negotiation with the provider as the next move.
The audit took two days. Of their 23 distinct production prompts, 7 of them were doing classification work that a smaller model handled at the same accuracy. Another 9 were extraction tasks that smaller models also handled cleanly. Only 7 prompts genuinely needed the flagship model, and even then only on a subset of inputs. We installed a router that picked the smallest model passing each prompt’s eval. The next month’s bill dropped to 6,200 dollars. Quality metrics did not move.
This piece is about that pattern. The architecture, the routing logic, the eval discipline that makes it work, and the failure modes to design against.
The pattern in one sentence
For every prompt, run the prompt’s eval suite against each candidate model. The cheapest model that passes is the one production routes to. Re-evaluate when the eval suite changes or when a new model becomes available.
That is the entire pattern. Everything else is implementation detail.
The architecture
Three components:
The eval suite per prompt. Each production prompt has a labelled set of test cases (typically 30 to 100). The pass criteria are explicit: a numeric threshold on a quality score, a structural check, or both. We covered eval discipline at length elsewhere; the short version is “automated, blocking, owned”.
The model catalogue. A list of candidate models with their per-token costs and approximate speed. A starting catalogue in 2026:
| Tier | Model | Input cost | Output cost | Use for |
|---|---|---|---|---|
| Flagship | Claude Sonnet 4.6 / GPT-4-class | high | high | Open-ended reasoning, long context, complex tool use |
| Mid | Claude Haiku 4.5 / GPT-4-mini-class | medium | medium | Most production prompts |
| Small | Provider-specific small / open-weights 8B | low | low | Classification, extraction, short-form generation |
| Specialised | Embedding, vision, audio | per-call | per-call | Specific modalities |
The router. The component that maps each request to a model. The cheapest implementation is a static dictionary keyed by prompt name. The most elaborate is a learned classifier. Most teams should start with the static dictionary.
# The router as a static map (the working version on 80% of engagements)
ROUTING = {
"customer-summary": "claude-haiku-4-5",
"intent-classification": "small-8b",
"policy-question-answering": "claude-sonnet-4-6",
"translation": "small-8b",
"agent-planning": "claude-sonnet-4-6",
"json-extraction": "small-8b",
# ... per prompt
}
def call_llm(prompt_id, prompt_text, **inputs):
model = ROUTING[prompt_id]
return llm_call(model, prompt_text.format(**inputs))
That is enough to capture most of the savings. A learned router can squeeze out a few more percent by handling per-input variation (some inputs to the same prompt may need a stronger model than others), but the engineering cost is typically not justified until the simple version is exhausted.
The eval discipline that makes routing work
Routing without evals is guessing. The eval is the only honest answer to “is this model good enough for this prompt”.
For each prompt, the eval suite must:
- Have at least 30 production-shaped test cases.
- Have explicit pass criteria (threshold, structural assertion).
- Run on every candidate model when routing decisions are made.
- Re-run on a quarterly cadence and on every model upgrade announcement.
Without this discipline, the routing decisions go stale. A model the team picked because it was the cheapest passing six months ago may no longer be the right choice; a smaller model may have improved enough to take over, or a corner case may have surfaced that needs a stronger model.
We have seen teams skip the per-prompt eval and route by intuition. The result is consistently worse than no routing at all: they save money on prompts where the cheaper model is fine and lose quality on prompts where it is not. The eval is not optional.
What the savings actually look like
Across the engagements where we have shipped routing in 2025 and 2026:
| Engagement profile | Pre-routing cost | Post-routing cost | Reduction |
|---|---|---|---|
| AI customer support, 1.2M monthly queries | $14,000/mo | $5,800/mo | 59% |
| Internal coding assistant, 80k devs | $42,000/mo | $24,500/mo | 42% |
| Document processing pipeline, 2.4M docs/mo | $9,200/mo | $4,100/mo | 55% |
| Marketing content generator, 350k pieces/mo | $6,800/mo | $4,300/mo | 37% |
| Engagement-wide average | 48% |
The wide range reflects the prompt mix. Engagements heavy on extraction and classification (where small models suffice) save the most. Engagements heavy on open-ended reasoning save less because more prompts genuinely need the flagship.
The savings appear in week one (the day routing ships) and compound as traffic grows. Teams that delay routing pay the difference forever; the savings are not retroactive.
Failure modes
Re-evaluating too rarely. A model that was the cheapest passing in 2025 may not be in 2026. Provider releases shift the catalogue every quarter. A team that routes once and never re-evaluates leaves money on the table.
Conflating cost with latency. Smaller models are not always faster. Specialised inference infrastructure on a flagship model can outperform a small model running on commodity hardware. Routing decisions should consider latency budgets too.
Routing inside the prompt instead of around it. Some teams ask the same model to “decide whether to use yourself or escalate to a stronger model”. This wastes a flagship-model call on the routing decision. Route in code, not in a prompt.
Per-user routing. Tempting to route VIP customers to the flagship and free-tier customers to the small model. Operationally fine, but observe the consistency property: if quality differs visibly, you have a tier system, not a routing strategy. Be deliberate about what you are doing.
Context-window wastage. A 1500-token request to a small model still costs less than to a flagship, but if the prompt could be cut to 500 tokens by truncating irrelevant context, the savings stack. Routing alone does not fix bloated prompts; combine it with context discipline.
When NOT to bother with routing
If your monthly LLM spend is under 1,000 dollars, the engineering time to install routing exceeds the savings. The break-even is roughly 3,000 to 5,000 dollars per month, after which the work pays back inside two months.
If all your prompts genuinely need the flagship (rare; usually two or three prompts but the audit shows the rest could route down), routing has nothing to optimise. This is uncommon outside genuinely complex agentic use cases.
If your provider relationship is locked into a single-model contract (some enterprise deals), routing across providers is contractually forbidden. The savings are then limited to within-provider tier choices, which is usually still meaningful.
What we install on engagements
Standard install:
- Audit the production prompt mix (one to three days)
- Per-prompt eval suite (one engineer-week)
- Static routing table with per-prompt model assignments (half a day)
- Observability on routing decisions and per-model cost (half a day)
- Quarterly re-evaluation calendar (process, not code)
Total: roughly two engineer-weeks for the first install. Pays back in the first or second month. Continues paying back forever.
The teams that install this once recover the engineering cost in weeks and keep the savings for the lifetime of the product. The teams that delay it pay flagship prices for tasks that smaller models would handle, and the bill scales linearly with usage. The work is small. The return is large. The only cost is the eval discipline, which the team needs anyway.
Questions teams ask
How do I know which model is cheapest for each prompt?
Run the prompt's eval suite against each candidate model. The cheapest model that passes the suite is the one to route to. The eval is the answer; opinion is not.
Doesn't the routing layer add latency?
Negligible. A rules-based router adds under a millisecond. A small classifier router (BERT-class) adds 5 to 20 ms. The added latency is dominated by the saved generation time on smaller models.
What about consistency across users?
Route deterministically by prompt category, not by user. The same prompt class always goes to the same model. Users get the same model for the same task; you do not have a hidden A/B test in production.