Pedro Bertoluchi

When NOT to use RAG: 4 patterns that look like RAG but aren't

RAG became the default answer for every text problem, and four common shapes are cheaper, faster and more accurate when solved with the right tool instead.

6 min read
Back to blog

RAG has become a hammer in search of nails. Half the architectures I review reach for an embedding model, a vector store and a reranker for problems that a SQL query or a small classifier solves in a tenth of the time and a hundredth of the bill. The discipline is recognizing the shape of the problem before reaching for the pattern, and four shapes show up over and over that are not RAG.

Structured FAQ is the first. A finite set of questions with curated answers is a lookup, not a retrieval problem. Normalize the question, hash it, cache the answer, fall back to fuzzy match with a string distance and only then escalate to a small model for paraphrase. The cost per request collapses, latency drops under fifty milliseconds, and the answer is whatever the business approved last week instead of whatever the model decided to paraphrase today.

Document classification is the second. Routing a ticket to a queue, tagging an invoice, deciding whether an email needs a human is a classifier problem. A fine-tuned small model, or even a logistic regression on top of embeddings computed once, beats a RAG pipeline on accuracy, latency and cost. Retrieval adds noise because the model sees similar but wrong examples and drifts toward them. The right move is to train on the labels you actually want to predict.

Exact match on identifiers is the third. A user types a SKU, an order number, a tax ID, a tracking code. Postgres full-text with a GIN index, or a trigram index for fuzzy variants, returns the row in single-digit milliseconds with perfect precision. An embedding model on the same query will sometimes return a semantically similar but operationally wrong record, which is the worst possible failure mode for a support agent who trusts the screen.

Single-document summarization is the fourth. When the input fits in the context window of a current frontier model, retrieval is a detour. Send the document, ask the question, get the answer with the citation as a character offset. GPT-5 and Claude Opus 4.6 both handle hundreds of thousands of tokens cleanly. Chunking a fifty-page contract into a vector store to ask one question about clause seven is engineering theatre, not architecture.

True RAG has a specific shape. An open question, asked over a corpus that changes faster than you can fine-tune, where the answer must cite a source the user can verify, and where the corpus is too large to fit in context. Internal knowledge bases, legal precedent, product documentation across thousands of pages. When the shape matches, RAG earns its complexity. When it does not, every layer you add is a tax the user pays in latency, the company pays in tokens, and the on-call pays at three in the morning.

Tags

  • #rag
  • #applied-ai
  • #architecture

Let's talk about your next project.

Share the challenge in a few lines. Within one business day I respond with a technical assessment and the next steps.