Last Updated: March 14, 2026
LLMs are powerful, but they have a major limitation: they only know what was included in their training data. They cannot access your private documents, internal knowledge bases, or the latest information unless you explicitly provide it. This is where Retrieval-Augmented Generation (RAG) comes in.
RAG is a technique that combines information retrieval with language generation. Instead of asking the model to rely solely on what it memorized during training, we first retrieve relevant information from external data sources such as documents, databases, or knowledge bases.
That retrieved context is then provided to the model as part of the prompt, allowing it to generate responses that are more accurate, grounded, and up-to-date.
In this chapter, we will explore how RAG works, why it has become a foundational pattern in AI engineering, and how it enables you to build applications that are both intelligent and grounded in real data.
Imagine you are building a customer support chatbot for an e-commerce company. The chatbot needs to answer questions like:
The first question might be answerable if the return policy happened to appear in the model's training data. But the other three require information that is either private (order status), constantly changing (promotions), or too specific to appear in any public dataset.
This is what we call the knowledge freshness problem. It has three dimensions.
The model's training data has a cutoff date. Anything that happened after that date is invisible. A model trained in March 2025 knows nothing about events in April 2025. It cannot tell you about new product launches, updated regulations, or recent security vulnerabilities.
Your internal knowledge, company policies, engineering docs, customer data, meeting notes, these were never in the training data to begin with. The model has zero knowledge of your organization's specific context.
Even within its training data, the model's knowledge is uneven. It knows a lot about popular topics and very little about niche domains. If your application deals with specialized medical procedures, obscure legal regulations, or proprietary engineering standards, the model's coverage is thin at best.
The worst part is not that the model does not know. It is that it does not know it does not know. LLMs generate text by predicting the most likely next token. When they lack real information, they fill the gap with plausible-sounding text. This is hallucination, and it makes the knowledge problem dangerous rather than just inconvenient.
So how do we fix this? There are three main approaches, and understanding the trade-offs between them is essential before you write a single line of code.
Fine-tuning takes a pre-trained model and continues training it on your specific data. You feed it examples of your documentation, your Q&A pairs, your domain knowledge, and the model updates its weights to incorporate this information.
This sounds like the obvious solution. The model does not know your data? Train it on your data. But in practice, fine-tuning solves a different problem than most people think.
Fine-tuning is excellent for teaching a model how to respond: the tone, format, style, and type of reasoning you want. It is not great for teaching a model what to know. Here is why.
When you fine-tune a model on your documentation, the knowledge gets baked into the model's weights. It becomes part of the model itself. That means every time your documentation changes, you need to fine-tune again. New product feature? Retrain. Updated policy? Retrain. Fixed a typo in your API docs? Technically, retrain.
Each fine-tuning run takes hours, costs real money (GPU compute is not cheap), and requires careful data preparation. For a company that updates its docs weekly, this cycle becomes unsustainable fast.
There is also a subtler problem: fine-tuning can degrade the model's general capabilities. This is called catastrophic forgetting. The model gets better at your specific domain but worse at everything else. Finding the right balance requires careful experimentation.
RAG takes a completely different approach. Instead of baking knowledge into the model, you keep it separate and inject it at query time.
When a user asks a question, you first search your knowledge base for relevant documents. Then you paste those documents into the LLM's prompt alongside the question. The model reads the provided context and generates an answer based on it.
The model itself never changes. What changes is the input you construct for each query.
This is a powerful separation of concerns. The model handles language understanding and generation. Your knowledge base handles information storage and retrieval. Update a document, and the very next query sees the change. No retraining, no GPU costs, no waiting.
Modern LLMs support increasingly large context windows. GPT-4o handles 128K tokens. Claude supports 200K tokens. Gemini can process over 1 million tokens. Why not just stuff all your documents into the prompt?
For small knowledge bases, this actually works. If your entire documentation fits in 50K tokens (roughly 40 pages), long context is simpler than building a RAG pipeline. No chunking, no embeddings, no vector database. Just paste everything in and ask your question.
But this approach hits practical limits quickly. Cost scales linearly with context length, processing 100K tokens costs 50 to 100 times more than processing 2K tokens per query. Latency increases too, longer prompts mean slower responses. And there is the well-documented "lost in the middle" problem: models struggle to use information buried deep within very long contexts, paying more attention to the beginning and end.
Here is how the three approaches stack up across the dimensions that matter most in practice.
For most real-world applications where you need an LLM to answer questions about your data, RAG is the right starting point. It handles large knowledge bases, supports instant updates, keeps costs reasonable, and does not require expensive retraining cycles.
That said, these approaches are not mutually exclusive. Many production systems combine them. You might fine-tune for domain-specific reasoning and use RAG for knowledge retrieval. Or use long context for a small set of critical documents and RAG for the rest.
RAG has two phases: an offline phase where you prepare your knowledge base, and an online phase where you answer questions.
Before you can retrieve anything, you need to turn your documents into something searchable. This happens once upfront, with updates whenever your content changes.
You load your documents, split them into chunks (smaller pieces that can be independently retrieved), convert each chunk into a vector embedding that captures its semantic meaning, and store everything in a vector database. Chunking strategy has a huge impact on retrieval quality, but we will keep things simple for now and build more sophisticated pipelines in the next chapter.
When a user asks a question, here is what happens step by step.
The beauty of this pattern is its simplicity. The model does not need to "know" anything about your domain. It just needs to read the provided context and generate a coherent answer. That is something LLMs are already very good at.
RAG was introduced in a 2020 paper by Lewis et al. at Facebook AI Research. Since then, it has become the dominant approach for building knowledge-grounded LLM applications. Here is why it won over the alternatives for this specific use case.
RAG cleanly separates knowledge (stored in the vector database) from reasoning (handled by the LLM). This means you can update knowledge without touching the model, and upgrade the model without rebuilding your knowledge base.
When a document changes, you re-embed the affected chunks and update the vector store. The next query immediately sees the new information. Compare this to fine-tuning, where an update means hours of retraining.
You only pay to process the retrieved chunks, not your entire knowledge base. A query that retrieves 5 chunks of 200 tokens each costs the same whether your knowledge base has 100 documents or 100,000.
You know exactly which documents the model used to generate its answer. This enables citations, source attribution, and auditing, things that are nearly impossible with fine-tuning where knowledge is opaquely encoded in model weights.
The model stays exactly as capable as it was before. You are augmenting it with context, not modifying it. A RAG-powered customer support bot can still handle general conversation because the underlying model has not been altered.
RAG is model-agnostic. You can swap GPT-4o for Claude or an open-source model without rebuilding your knowledge pipeline. The retrieval layer stays the same.
RAG is powerful, but it is not a universal solution. There are scenarios where other approaches work better.
If you want the model to respond in a specific tone, follow a strict format, or reason in a domain-specific way, fine-tuning is more appropriate. RAG gives the model information. Fine-tuning changes how it processes information.
If your entire documentation fits in 20-30 pages, long context is simpler. You skip the complexity of building a retrieval pipeline and just paste everything into the prompt. The cost difference is minimal at this scale.
Some domains have documents that are so similar to each other that semantic search struggles to distinguish between them. Legal contracts with subtle but critical differences, medical protocols that share 90% of their language, code snippets that look alike but behave differently. In these cases, RAG may retrieve the wrong context, leading to wrong answers with high confidence.
RAG works with a pre-indexed knowledge base. If the answer depends on live data (current stock prices, real-time sensor readings, live API responses), you need tool use or function calling instead of, or in addition to, RAG.
Standard RAG retrieves a handful of chunks and asks the model to synthesize an answer. If the answer requires combining information from dozens of documents with multi-step reasoning, basic RAG falls short. You need more advanced patterns like GraphRAG or multi-hop retrieval, which we cover later in this module.
Enough theory. Let's build a working RAG system from scratch. This implementation is deliberately minimal, no frameworks, no abstractions, just the core pattern.
We will use ChromaDB as our vector store, OpenAI's text-embedding-3-small for embeddings, and gpt-4o-mini for generation.
Let's walk through what this code does.
First, we create a ChromaDB collection and add 10 documents. Each document gets embedded using text-embedding-3-small and stored alongside its text. In a real system, these would be chunks from your actual documentation, not handwritten sentences.
The ask function implements the core RAG loop. It embeds the user's question using the same model, queries ChromaDB for the 3 most similar documents, assembles them into a context string, and passes everything to GPT-4o-mini with instructions to only use the provided context.
Notice the last query: "What programming languages does Acme support?" None of our documents mention programming languages, so the model correctly responds that it does not have that information. This is exactly the behavior we want. Without RAG, the model might have hallucinated an answer. With RAG, it knows the boundaries of its knowledge because those boundaries are defined by the context you provide.
This 50-line system captures the core RAG pattern, but it cuts many corners that production systems cannot afford to cut.
We hardcoded 10 sentences. Real systems ingest PDFs, Markdown files, HTML pages, and database records, each requiring different parsing logic.
Our "documents" are single sentences. Real documents are pages or chapters that need intelligent splitting to balance context preservation with retrieval precision. Chunking strategy dramatically affects retrieval quality.
We do not track where each chunk came from, what page it was on, or when it was last updated. Without metadata, you cannot provide citations or filter by source.
API calls fail. Embeddings take time. Vector stores can be unavailable. Production pipelines need retry logic, timeouts, and graceful degradation.
Users ask vague, misspelled, or multi-part questions. Production systems transform queries before retrieval to improve results.
We have no way to measure whether our system is actually giving good answers. Without evaluation, you are flying blind.
These are not nice-to-haves. They are the difference between a demo that impresses in a meeting and a system that actually serves users reliably. The next chapter dives deep into each of these gaps and shows you how to build a production-grade RAG pipeline.