RAG Is a Toy
Why Retrieval-Augmented Generation Won’t Scale to Real Enterprise AI
The most widely adopted AI architecture for enterprise knowledge is built on a probabilistic foundation that breaks in production. A growing body of research — from language drift to embedding theory to mechanistic interpretability — reveals why, and points to what comes next.
What Is RAG (and vectors)?
Retrieval-Augmented Generation (“RAG”) is the architecture that most enterprises have adopted to connect AI models to their own data. How it works is straightforward: instead of relying solely on what a LLM1 has been trained on, you give it access to your documents in real time to expand it’s context of you or your business.
To understand RAG you need to understand vectors. I was introduced to vectorization by an Computational Ontologist named Jefferson (unicorn-grade skills and one of the most important jobs of this era - more on that in another post. Also an amazing human) with the perfect analogy:
Vectors represent the mathematical distance between concepts in a knowledge space. Take the word “house.” There’s the physical dwelling with walls, a roof, somewhere you put furniture and call home. There’s also “apartment”. The vector distance between these two concepts is small, because they share most of the same properties. You live in both, they have rooms, you furnish them.
Now consider the television show House — about a cranky but brilliant doctor. The word is the same, but the vector distance from the physical dwelling is large. The concepts share almost nothing beyond the label. An embedding model has to compress both meanings into the same mathematical space, and when your RAG system retrieves documents about “house,” it’s relying on that compression to know which one you mean.
In practice with RAG, this means adding your documents and then converting them into numerical vectors (called embeddings) and storing them in a vector database so that when a user asks a question the system searches for the most semantically similar chunks and feeds them to the LLM alongside the query. The LLM is able to generate an answer “grounded” in your data rather than what it was originally trained on. You get the power of the LLMs ‘thinking’ ability, but its pointed to your world of knowledge.
It’s an elegant idea, and it works well enough for demos. The problem is the vector distance of “well enough for demos” and “reliable enough for enterprise” is very large (pardon the pun). A growing body of peer-reviewed research is revealing that RAG’s limitations aren’t engineering problems to be optimised away. They are larger and more structural.
New Research
A Peking University team recently published a study examining what happens when RAG systems retrieve documents in a different language from the user’s query -routine in any global business. They found that models consistently generate answers in the wrong language, not because they misunderstood the query, but because English dominates the training data so heavily that probabilistic decoding defaults to English regardless of instruction.
What this means is that the LLM understood the question perfectly. It simply couldn’t control its own output (Li, Xu & Xie, 2025).
You’re probably thinking that this is a edge case around translation/language. Translations have errors all the time. It isn't. An ICLR 2025 paper showed that even when retrieval returns the right answer, models still hallucinate — because the model's own learned knowledge physically overwrites the retrieved information during generation (Sun et al., 2025). A separate paper found the same thing: the model's built-in knowledge progressively erases the retrieved evidence the further it gets into generating a response (CoRect, 2025). These experiments show that the model doesn’t fail to retrieve the right answer. It retrieves it, then ignores it.
RAG itself has limits. A Google DeepMind team proved mathematically that as a document corpus grows, the vector distances that RAG relies on to find the right information converge. This means that everything starts looking equally relevant. At scale, the system can't tell the difference between the document you need and thousands of near-identical matches (Weller et al., 2025). The gap between "house" and House that seemed clear at small scale starts to collapse — and so does every other distinction your business depends on.
The Scale Problem
The RAG ecosystem has coalesced around demos involving 5,000 to 50,000 documents and calls it enterprise-grade. This is a toy problem.
A mid-market company generates tens-to-hundreds of thousands of documents per year. A single workflow in an ERP system produces millions of transactional records. And that’s before AI agents start generating their own outputs like logs, decisions, audit trails, and exception reports at machine speed. A single agent processing invoices across three subsidiaries will generate more structured data in a month than most RAG demos use as their entire corpus.
The DeepMind research explains why this matters: RAG doesn’t degrade linearly with scale. It breaks combinatorially. A system that works over 10,000 documents may not work over 100,000. RAG’s retrieval mechanism was designed for a world where the knowledge base is static and small.
Market Ahead of Product
Why did we just discover about the limitations of RAG? My take is that it’s because of what Jeremy Clarkson from TopGear loves…POWER! The hardware story over the past decade is extraordinary: context windows from 4,000 to over 1 Million tokens, inference costs down 90% in two years. Research comparing long-context models directly against RAG found that simply feeding entire documents into these larger context windows (think the prompt window in Gemini) often outperforms RAG approaches (Li et al., 2025). The hardware is already solving problems the RAG software architecture was designed to address, and we have arguably unlimited amounts of it.
The AI software story is less impressive. The dominant probabilistic architecture was adequate as a first-generation experiment. It demonstrated that LLMs could work with private data. But it was never designed for deterministic, auditable execution. There is a growing trend in research (article coming) that shows that schema and structure amplifies this new intelligence.
Industry data suggests 80% of enterprise RAG projects experience critical failures, with AI project failure rates hitting 42% in 2025, a 2.5x increase from the prior year (Bruckhaus, 2024). We have the compute to run extraordinarily powerful AI systems. We don’t yet have the software architecture to make them reliably do what we tell them to do.
Orchestration and Determinism
The antidote isn’t more RAG. It’s orchestration.
Every significant fix in the research literature points the same direction. The language drift paper forces the model to stay in the right language. The interpretability papers stop the model from overriding what it retrieved. Each of these solution approaches are a deterministic constraint imposed on a probabilistic system to force predictable behaviour. A February 2026 paper makes this argument explicit:
“no training-only procedure can provide a deterministic guarantee of command-data separation under adversarial conditions” (arXiv:2602.09947).
Probabilistic learned behaviour cannot substitute for deterministic structural enforcement.
Instead of feeding documents into probabilistic LLMs and hoping the output is correct, you build an orchestration layer that decomposes tasks into structured steps, routes each step to the appropriate capability, and enforces deterministic constraints on the output. The LLM provides intelligence. The orchestration layer provides reliability.
RAG asks: “given these documents, what answer would you generate?” An orchestrated system asks: “given this task specification, execute these steps in this order with these constraints and produce output in this format.” The difference is the difference between an experiment and infrastructure.
The Counterargument, and Why It Proves the Point
The strongest counterargument comes from within the RAG community: “Agentic RAG.” A growing body of work proposes solving RAG’s limitations by wrapping classic RAG approaches in multi-agent orchestration: agents that decide when to retrieve, how to validate content, and when to loop back for more.
This is presented as RAG’s evolution. It is actually RAG’s concession.
When the solution to RAG’s unreliability is deterministic orchestration, structured workflows, and validation agents, you haven’t improved RAG. You’ve built an orchestration system that happens to use RAG as one input among many. The retrieval is no longer the architecture. The orchestration is. Every serious attempt to make RAG production-grade converges on the same place: orchestration, determinism, and structured task decomposition.
What Comes Next
Foundation model providers have strong incentives to position RAG as the enterprise solution as it keeps customers consuming API calls and storage. Being token-wasteful drives revenue, revenue drives valuation, valuation allows for more capital raising, capital builds more data centres/compute for more tokens - and so spins the flywheel. The “upload your docs and ask questions” story is simple to sell. It is also, as the research increasingly shows, a ceiling rather than a foundation.
The next generation of enterprise AI won’t retrieve and generate. It will orchestrate and execute — decomposing business processes into task-level operations, applying AI within deterministic workflow constraints, and producing auditable outputs at a scale that makes current RAG benchmarks look like a science fair project. The product needs to catch up to the market: not with more parameters or larger context windows, but with architecture that treats reliability as a design requirement, not a debugging exercise.
References
Li, B., Xu, Z. & Xie, R. (2025). “Language Drift in Multilingual Retrieval-Augmented Generation.” arXiv:2511.09984.
Weller, O., Boratko, M., Naim, I. & Lee, J. (2025). “On the Theoretical Limitations of Embedding-Based Retrieval.” arXiv:2508.21038. Google DeepMind.
Sun, Z. et al. (2025). “ReDeEP: Detecting Hallucination in RAG via Mechanistic Interpretability.” arXiv:2410.11414. ICLR 2025 Spotlight.
CoRect (2026). “Context-Aware Logit Contrast for Hidden State Rectification.” arXiv:2602.08221.
Bruckhaus, T. (2024). “RAG Does Not Work for Enterprises.” arXiv:2406.04369.
“Trustworthy Agentic AI Requires Deterministic Architectural Boundaries.” (2025). arXiv:2602.09947.
Li, X., Cao, Y., Ma, Y. & Sun, A. (2025). “Long Context vs. RAG for LLMs.” arXiv:2501.01880.
Bandara, E. et al. (2025). “A Practical Guide for Production-Grade Agentic AI Workflows.” arXiv:2512.08769.
RAG is used across multiple types of pre-trained models. I am simplifying here for the purposes of storytelling.

