When RAG Actually Helps and When It Hides Bad Retrieval
- RAG is for retrieval problems, not for reasoning problems.
- If your retrieval recall is below 70%, the LLM is hiding a bad search index, not adding value.
- A measured RAG system needs three eval surfaces: retrieval quality, generation quality and end-to-end answer quality.
- Fine-tuning beats RAG when the answers are stable and the corpus changes slowly.
A client asked us to audit a customer-support RAG system that had been in production for nine months. The team had built it carefully: a Pinecone vector store, OpenAI embeddings, a hybrid search layer, GPT-4 for generation. The product worked well in demos. Customer satisfaction scores were dropping in production.
We measured retrieval recall on a labelled question set. It was 41%. Slightly more than half the time, the right answer was not in the retrieved context, and the LLM was generating from partial information that looked complete. The team had built a beautiful generation layer on top of a search index that was failing more than it succeeded.
That experience generalises. RAG is the default architecture for knowledge-grounded LLM systems, and a meaningful fraction of those systems are RAG by reflex rather than by analysis. This piece is the four-question test we run before we recommend a RAG architecture, and the alternatives that win when the test points away from RAG.
What RAG is good at
RAG fits a specific shape of problem: the question can be answered from a body of text the LLM did not see during training, and the right answer is in some retrievable subset of that text. The retrieval step finds the relevant passages, the LLM synthesises an answer grounded in them.
This shape is real and common: customer support knowledge bases, internal documentation search, legal precedent lookup, regulated content where the answer must come from a specific source. RAG is the right architecture for all of these, when the retrieval works.
What RAG is bad at
RAG is bad at problems that look like retrieval but are actually something else.
Reasoning problems disguised as retrieval. A query like “given these three customer plans, which is most cost-effective for a team of fifteen with these usage patterns” is not a retrieval problem. The right answer requires reasoning over context that may need to be assembled from many sources. RAG can supply the source documents but the LLM still has to reason; if the reasoning is the failure mode, fixing retrieval will not help.
Workflow problems disguised as retrieval. A query like “create a refund for customer X” is a tool-use problem. The LLM should call a refund API, not retrieve documentation about how refunds work. RAG architectures sometimes get layered on top of these problems to “give the LLM context,” but the right answer is to give the LLM tools.
Stable-answer problems on a slow-changing corpus. If the corpus changes monthly and the same questions get asked daily, fine-tuning will outperform RAG on both quality and per-query cost. RAG’s only structural advantage is freshness; if freshness is not a requirement, the advantage disappears.
Conversational generation that does not need grounding. Open-ended creative tasks, summarisation of provided text, format transformation. None of these benefit from retrieval. Adding RAG to them adds latency and noise.
The four-question test
Before we recommend RAG on an engagement, we ask:
1. Is the answer to the user’s question genuinely in a retrievable corpus? If the answer requires synthesis across many sources, or reasoning that the corpus does not contain, RAG will not produce it.
2. Does the corpus change often enough to need freshness? If the corpus is stable for months at a time, fine-tuning is on the table. If it changes weekly or daily, RAG is the right architecture.
3. Can we measure retrieval quality independently of generation quality? If yes, we can debug the system. If no, the system is unmaintainable; you cannot tell whether a bad output is a retrieval miss or a generation miss.
4. What is the per-query economics? Below roughly 50,000 monthly queries, RAG is cheaper. Above 200,000 monthly queries with stable answers, fine-tuning is often cheaper. The crossover depends on context window size and answer length.
The decision matrix that usually emerges:
| Question shape | Corpus freshness | Volume | Right architecture |
|---|---|---|---|
| Retrieval-shaped | Daily/weekly | Any | RAG |
| Retrieval-shaped | Stable for months | High | Fine-tune (with periodic re-train) |
| Reasoning-shaped | Any | Any | Better reasoning prompt + tools, not RAG |
| Workflow-shaped | Any | Any | Tool use + structured output, not RAG |
| Open-ended generation | N/A | Any | Plain LLM, not RAG |
A non-trivial number of production systems land in the wrong cell. The audit usually finds them in two places: reasoning problems “solved” with RAG (where adding more retrieved context does not help because the bottleneck is reasoning), and workflow problems “solved” with RAG (where the LLM is summarising documentation about an action instead of taking the action).
The three eval surfaces a measured RAG system needs
If RAG is the right architecture, the system needs measurement at three layers, separately:
Retrieval quality: given a labelled question set, what is the recall@K? What is the mean reciprocal rank of the right answer? These are search-engine metrics, measured against a held-out evaluation set the team curates and updates.
Generation quality: given the right context (manually selected, not retrieved), how often does the LLM produce the correct answer? This isolates the LLM’s ability to synthesise from good context.
End-to-end answer quality: given the user’s question and the production retrieval pipeline, how often does the user get the right answer? This is the customer’s experience.
Without these three, an end-to-end failure has no clear cause. The team will tune one layer in response to a failure that lived in another, and the system will not improve. With these three, every failure can be attributed to a layer, and the right work goes to the right place.
| Surface | Metric | Acceptable target |
|---|---|---|
| Retrieval | Recall@5 on labelled questions | > 0.85 |
| Generation | Answer correctness with golden context | > 0.92 |
| End-to-end | Answer correctness on production queries | > 0.80 |
The end-to-end target is lower because it multiplies the other two. A retrieval recall of 0.85 and a generation correctness of 0.92 yields a theoretical ceiling of 0.78 end-to-end, ignoring the cases where the LLM still answers correctly from partial context (and those where it answers incorrectly with full context).
When fine-tuning is the better answer
Fine-tuning becomes attractive when several conditions stack: the answers are stable for months, the question patterns are repetitive, the volume is high, and the corpus is small enough to fit in a fine-tuning dataset (typically under 100,000 question-answer pairs).
A canonical example: a product support assistant for a stable SaaS product. The product changes quarterly, not daily. Customers ask the same hundred questions in many phrasings. Volume is millions of queries per month. The corpus of correct answers is well-curated.
For this shape, fine-tuning a smaller model produces:
- Lower per-query cost (no embedding lookup, no large context window)
- Lower latency (no retrieval round-trip)
- Higher consistency (the same question always gets the same answer)
- A clear migration path on corpus updates (re-fine-tune monthly)
The trade-off is real: fine-tuning costs an upfront training run, and updates require re-training. For high-volume stable-answer products, the trade-off is overwhelmingly worth it.
Hybrid systems
The honest answer for many production systems is hybrid: a fine-tuned base for the common questions, with RAG for the long tail and for fresh content. The router decides which path each query takes based on cheap heuristics (intent classification, query length, novelty score against the training set).
Hybrid is more complex to operate. It is also the right architecture for the largest customer-facing systems we have audited, because no single approach handles the full distribution of queries cost-effectively.
Default to RAG when starting out, measure honestly, evolve toward fine-tune or hybrid as the volume and stability of the queries warrants. The mistake is to lock in RAG forever because it was the right answer at twelve months and the system grew past that point without anyone re-asking the four questions.
What teams over-invest in within RAG
Three places where engineering effort within a RAG system is consistently misallocated:
Embedding model upgrades. Switching from OpenAI ada-002 to text-embedding-3-large is a meaningful improvement on some benchmarks. It is rarely the bottleneck on production retrieval. Measuring recall@5 before and after is usually a smaller change than reranking or query rewriting would have produced.
Vector database brand wars. Pinecone, Weaviate, Qdrant, Chroma, pgvector. The choice matters operationally (cost, latency, ergonomics). It rarely matters for retrieval quality. Teams spend weeks A/B testing vector databases when their recall@5 problem is the chunk size of their documents.
Chunk size dogma. “Use 512 tokens” is repeated as if it were a derived truth. Chunk size depends on the document type, the question type, and the embedding model. Measure on a labelled set rather than copying a recipe.
The under-invested places, by contrast: query rewriting (almost always lifts recall), reranking (LLM-judge rerank against top-50 retrieved), and the labelled evaluation set (the artefact that lets you measure any of this).
The honest summary
RAG is a powerful pattern when applied to retrieval-shaped problems with measurable quality. It becomes a cargo-cult when applied to anything that involves an LLM and any kind of context. The four-question test takes thirty minutes of analysis and saves teams from architectures that look modern and produce mediocre results. The eval surfaces take a week to set up and turn an unmaintainable system into one the team can actually improve.
The teams that get RAG right treat it as a search engine with an LLM glued to the front. The teams that get it wrong treat it as magic.
Questions teams ask
Is RAG always cheaper than fine-tuning?
Per-query, no. Each RAG call pays for retrieval (vector lookup), context-window tokens (embedding-derived passages) and generation. Fine-tuning has high upfront cost and low marginal cost per call. Cross-over is roughly 50,000 to 200,000 monthly queries depending on context window size. Above that volume, fine-tuning often wins on cost.
What recall threshold is acceptable?
Above 80% recall@5 is usable. Above 90% is good. Below 70% the LLM is generating from incomplete context and the answers will look plausible but be wrong in subtle ways. Measure recall on a labelled set of questions before judging the LLM's output.
When is hybrid search better than dense vectors alone?
Almost always for non-conversational corpora. Dense vectors miss exact-match queries (product codes, names, identifiers). Hybrid search (BM25 + dense) trades a small latency cost for a meaningful recall gain on heterogeneous queries. Default to hybrid; switch to dense-only when measurement shows it equivalent.