{"title":"What is RAG?","description":"","content":"Language models do not automatically know your product catalog, incident reports, customer records, or yesterday's policy change. A model can only answer from information available in its weights, in the prompt, or through tools connected to the application. Retrieval-Augmented Generation (RAG) is the most common pattern for giving a model the right information at the moment it needs it.\n\nRAG combines **information retrieval** with **language generation**. Instead of relying on whatever the model absorbed during training, the application first retrieves relevant evidence from external sources such as documents, tickets, databases, code repositories, or knowledge bases.\n\nThat retrieved evidence is then included in the model input. The model still generates the answer, but it is asked to work from supplied context rather than memory. When retrieval is good and the prompt is clear, this produces answers that are easier to verify and update.\n\nIn this chapter, we will cover how RAG works, why it became a standard AI engineering pattern, when it is a better fit than fine-tuning or long-context prompting, and where basic RAG is not enough.\n\n---\n\n# The Knowledge Freshness Problem\n\nConsider a customer support chatbot for an e-commerce company. It needs to answer questions like:\n\n- \"What is your return policy for electronics?\"\n- \"My order #12847 hasn't arrived. What's the status?\"\n- \"Do you offer student discounts?\"\n- \"I saw a promotion on your website yesterday. Is it still active?\"\n\nThe first question might be answerable if the return policy happened to appear in the model's training data. But the other three require information that is either private (order status), constantly changing (promotions), or too specific to appear in any public dataset.\n\nThis is the **knowledge freshness problem**. It shows up in three practical ways.\n\n#### **Temporal freshness**\n\nThe model's training data has a cutoff date. Information created after that date is not in the model's weights. A model trained in March 2025 will not know about a policy changed in April 2025 unless the application provides that information. It cannot reliably answer about new product launches, updated regulations, or recent security vulnerabilities from memory alone.\n\n#### **Organizational freshness**\n\nYour internal knowledge, company policies, engineering docs, customer data, and meeting notes were never in the training data to begin with. Unless your application provides that information, the model has no reliable access to your organization's context.\n\n#### **Domain freshness**\n\nEven within its training data, the model's knowledge is uneven. It knows more about common public topics than about niche domains. If your application deals with specialized medical procedures, obscure legal regulations, or proprietary engineering standards, the model's coverage may be thin.\n\n\n```mermaid\nflowchart TD\n subgraph Problem[\"THE KNOWLEDGE FRESHNESS PROBLEM\"]\n MODEL[\"Language Model
Trained on public data
up to cutoff date\"]:::primary\n\n T[\"Temporal Gap
Events after training cutoff
are invisible\"]:::red\n O[\"Organizational Gap
Internal docs, policies,
customer data\"]:::red\n D[\"Domain Gap
Niche or specialized
knowledge is thin\"]:::red\n\n MODEL --> T\n MODEL --> O\n MODEL --> D\n\n T --> H[\"Result: Unsupported,
outdated, or
confidently wrong answers\"]:::orange\n O --> H\n D --> H\n end\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n```\n\n\nMissing knowledge does not automatically produce silence. A model may still produce a fluent answer when the evidence is absent, because language models are trained to continue text. Without retrieval, abstention rules, and verification, that gap can turn into confident, unsupported text.\n\nThere are three main approaches to this problem, and the trade-offs between them shape most of the decisions that follow.\n\n---\n\n# Three Approaches to Grounding Models\n\n### Approach 1: Fine-Tuning\n\nFine-tuning takes a pre-trained model and continues training it on your specific data. You give it examples from your documentation, Q&A pairs, domain language, or task workflow, and the model updates its weights.\n\nThis can sound like the obvious solution: if the model does not know your data, train it on your data. In practice, fine-tuning solves a narrower problem.\n\nFine-tuning is usually best for teaching a model **how** to respond: tone, format, tool-use patterns, domain-specific phrasing, or task behavior. It is usually a poor primary mechanism for teaching a model **what facts are true today**.\n\nWhen you fine-tune a model on your documentation, the information gets baked into the model's weights. It becomes part of the model itself. That means every time your documentation changes, you may need another training and deployment cycle, whether the change is a new product feature, an updated policy, or a fixed typo in your API docs.\n\nEach fine-tuning run requires data preparation, evaluation, deployment, and rollback planning. For a company that updates its docs weekly or daily, that is the wrong operational loop.\n\nThere is also a subtler risk: a poorly designed fine-tune can degrade general behavior or overfit to stale examples. Modern instruction-tuning methods reduce this risk, but they do not eliminate the maintenance problem.\n\n### Approach 2: Retrieval-Augmented Generation (RAG)\n\nRAG takes a different approach. Instead of baking knowledge into the model, you keep the knowledge separate and provide it at query time.\n\nWhen a user asks a question, you first search your knowledge base for relevant documents. Then you include those documents in the model's prompt alongside the question. The model reads the provided context and generates an answer based on it.\n\nThe model itself never changes. What changes is the input you construct for each query.\n\nThis separation of concerns is the reason RAG is so useful. The model interprets the question and generates the answer. The retrieval system handles storage, filtering, ranking, access control, and freshness. Update a document, re-index the affected chunks, and the next query can use the new information without retraining the model.\n\n### Approach 3: Long Context Windows\n\nModern language models support large context windows. Many production models can handle very large prompts. So why not put the whole knowledge base into the prompt?\n\nFor small knowledge bases, this can work well. If your entire documentation fits comfortably in the prompt, long context is simpler than building a RAG pipeline. No chunking, no embeddings, no vector database. You include the relevant documents and ask your question.\n\nBut this approach hits practical limits. Cost and latency grow with prompt size. Long prompts also make access control, citations, and source selection harder. Recent long-context models are better than earlier systems at using information buried deep in a prompt, but retrieval is usually cheaper and easier to control for large or frequently changing corpora.\n\n### Comparing the Three Approaches\n\nHere is how the three approaches stack up across the dimensions that matter most in practice.\n\n\n| Dimension | Fine-Tuning | RAG | Long Context |\n|-----------|------------|-----|--------------|\n| **Knowledge updates** | Requires another training cycle | Update the index when content changes | Swap prompt content |\n| **Cost to update** | High (GPU compute, data prep) | Low (re-embed changed docs) | Low (swap documents in prompt) |\n| **Cost per query** | Same as base model | Base model + embedding + retrieval | High (pay for all tokens every query) |\n| **Knowledge base size** | Limited by training and maintenance cost | Large corpora | Limited by context window |\n| **Accuracy on your data** | Good if well-tuned, risk of stale facts | Good if retrieval works well | Good for small knowledge bases |\n| **Latency** | Same as base model | Adds a retrieval step | Increases with context length |\n| **Complexity to build** | Moderate (data prep, training pipeline) | Moderate (retrieval pipeline) | Low (construct the prompt) |\n| **Best for** | Style, format, domain reasoning | Large, changing knowledge bases | Small, static knowledge bases |\n\n\nFor many applications where a model needs to answer from private or changing data, RAG is the right starting point. It handles large corpora, supports targeted updates, keeps per-query cost bounded, and gives you source-level observability.\n\nThese approaches are not mutually exclusive. Many production systems combine them. You might fine-tune for response style and use RAG for factual retrieval. Or use long context for a small set of important documents and RAG for the rest.\n\n---\n\n# How RAG Works: The Big Picture\n\n\n\n\n\nRAG has two phases: an offline phase where you prepare your knowledge base, and an online phase where you answer questions.\n\n### The Offline Phase: Indexing Your Documents\n\nBefore you can retrieve anything, you need to turn your documents into something searchable. This happens once upfront, with updates whenever your content changes.\n\n\n```mermaid\nflowchart TB\n subgraph Indexing[\"OFFLINE: Index Documents\"]\n D[\"Raw Documents
PDFs, Markdown,
HTML, Docs\"]:::primary\n D --> C[\"Chunk
Split into smaller
retrievable pieces\"]:::orange\n C --> E[\"Embed
Convert each chunk
to a vector\"]:::teal\n E --> V[(\"Vector Store
(ChromaDB)\")]:::green\n end\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\nYou load your documents, split them into chunks, convert each chunk into a vector embedding, and store everything in a vector database. A chunk is a smaller piece of content that can be retrieved on its own. Chunking strategy has a large impact on retrieval quality, but we will keep things simple for now and build more capable pipelines later.\n\n### The Online Phase: Answering Questions\n\nWhen a user asks a question, here is what happens step by step.\n\n\n```mermaid\nflowchart LR\n Q[\"User Question\"]:::primary\n Q --> E1[\"1. Embed the Question
Same model used for documents\"]:::teal\n E1 --> S[\"2. Search Vector Store
Find most similar chunks\"]:::orange\n S --> V[(\"Vector Store\")]:::purple\n S --> R[\"3. Retrieve Top-K Chunks
Most relevant context\"]:::orange\n R --> P[\"4. Build Prompt
Question + Retrieved Context\"]:::teal\n P --> MODEL[\"5. Generate Answer
Model reads context and responds\"]:::green\n MODEL --> A[\"Context-grounded Answer
Based on retrieved evidence\"]:::green\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef purple fill:#f783ac,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\n1. **Embed the question.** The user's query gets converted to a vector using the same embedding model you used for your documents. Both need to exist in the same vector space for similarity comparisons to work.\n2. **Search the vector store.** The vector database finds the chunks whose embeddings are closest to the query embedding. \"Closest\" is usually measured by cosine similarity, though some vector stores default to Euclidean distance or inner product.\n3. **Retrieve the top-K chunks.** You pull back the K most relevant chunks. K is typically 3 to 10, depending on how much context you want to provide to the model and how large your context window budget is.\n4. **Build the prompt.** You construct a prompt that includes the user's question and the retrieved context. You also include instructions telling the model to answer based on the provided context only.\n5. **Generate the answer.** The model reads the question and the context, then generates a response grounded in the retrieved evidence.\n\nThe pattern is simple, but the quality comes from engineering discipline. The model does not need your domain knowledge in its weights. It needs the right evidence, formatted clearly, with instructions about what to do when that evidence is insufficient.\n\n---\n\n# Why RAG Became Common\n\nThe modern RAG formulation was introduced in a 2020 paper by Lewis et al. at Facebook AI Research. Since then, the pattern has become a standard architecture for knowledge-grounded language-model applications because it matches the operational shape of real software systems.\n\n#### **Separation of concerns**\n\nRAG separates knowledge management from answer generation. This means you can update knowledge without touching the model, and upgrade the model without rebuilding your knowledge base.\n\n#### **Fast updates**\n\nWhen a document changes, you re-embed the affected chunks and update the vector store. The next query can use the new information. Compare this to fine-tuning, where an update usually means another training and deployment cycle.\n\n#### **Cost efficiency**\n\nYou usually send only the retrieved chunks to the generation model, not your entire knowledge base. A query that retrieves 5 chunks of 200 tokens each has a bounded prompt size, even if the knowledge base contains thousands of documents.\n\n#### **Transparency**\n\nYou can inspect which documents were retrieved and passed to the model. This enables citations, source attribution, and auditing, which are much harder when knowledge is encoded only in model weights.\n\n#### **No retraining side effects**\n\nThe model itself is not modified. You are augmenting it with context. A RAG-powered customer support bot can still use the base model's general language ability because the underlying model has not been altered.\n\n#### **Works with many models**\n\nRAG is mostly model-agnostic. You can often swap the generation model or provider without rebuilding the knowledge pipeline, as long as the new model can follow the same grounding and citation instructions. The retrieval layer can stay the same.\n\n---\n\n# When RAG Is Not the Right Choice\n\nRAG is useful, but it is not a universal answer. There are cases where it adds complexity without solving the real problem.\n\n#### **When you need to change the model's behavior, not its knowledge**\n\nIf you want the model to respond in a specific tone, follow a strict format, or reason in a domain-specific way, fine-tuning is more appropriate. RAG gives the model information. Fine-tuning changes how it processes information.\n\n#### **When your knowledge base is very small**\n\nIf your entire documentation fits comfortably in the prompt, long context is simpler. You skip the complexity of building a retrieval pipeline and include the relevant documents directly. The cost difference may be small at this scale.\n\n#### **When retrieval quality is fundamentally unreliable**\n\nSome corpora contain documents that are nearly identical except for details that matter. Legal clauses, medical protocols, API versions, and code snippets often differ by one word, number, or condition. Basic semantic search can retrieve a plausible but wrong neighbor. These systems need keyword search, metadata filters, reranking, version constraints, and sometimes symbolic retrieval in addition to embeddings.\n\n#### **When real-time data is required**\n\nRAG usually works against a pre-indexed knowledge base. If the answer depends on live state (current stock prices, sensor readings, order status, inventory, account balance), call the source of truth directly. You can combine tool calls with RAG, but do not treat yesterday's index as real-time data.\n\n#### **When the question requires complex reasoning across many documents**\n\nStandard RAG retrieves a handful of chunks and asks the model to synthesize an answer. If the answer requires combining information from dozens of documents with multi-step reasoning, basic RAG may not be enough. You may need patterns such as multi-hop retrieval, graph-based retrieval, or a more structured workflow.\n\n---\n\n# Your First Small RAG System\n\nLet's build a small working RAG system. This implementation is deliberately minimal: no framework, no extra abstractions, only the core pattern.\n\nWe will use ChromaDB as our vector store, OpenAI's `text-embedding-3-small` for embeddings, and a configurable chat model for generation.\n\n\n**main.py**\n\n```python\nimport chromadb\nimport os\nfrom openai import OpenAI\nfrom dotenv import load_dotenv\n\nload_dotenv()\nclient = OpenAI()\nEMBEDDING_MODEL = os.getenv(\"OPENAI_EMBEDDING_MODEL\", \"text-embedding-3-small\")\nCHAT_MODEL = os.getenv(\"OPENAI_MODEL\", \"gpt-4.1-mini\")\n\n# Initialize ChromaDB (in-memory client; data is not persisted to disk)\nchroma = chromadb.Client()\ncollection = chroma.create_collection(name=\"my_docs\")\n\n# 10 sample documents about a fictional company\ndocuments = [\n \"Acme Corp offers a 30-day return policy for all physical products.\",\n \"Digital products can be refunded within 7 days of purchase.\",\n \"Premium subscribers get priority support with a 4-hour response time.\",\n \"Free-tier users have access to email support with 48-hour response times.\",\n \"Acme Corp was founded in 2019 and is headquartered in Austin, Texas.\",\n \"The enterprise plan includes dedicated account management and SLA guarantees.\",\n \"Student discounts of 20% are available with a valid .edu email address.\",\n \"Acme Corp processes refunds within 5-7 business days to the original payment method.\",\n \"The API rate limit for free-tier users is 100 requests per minute.\",\n \"Enterprise customers can request custom rate limits through their account manager.\",\n]\n\n# Step 1: Embed and index all documents\nembeddings = []\nfor doc in documents:\n response = client.embeddings.create(input=doc, model=EMBEDDING_MODEL)\n embeddings.append(response.data[0].embedding)\n\ncollection.add(\n documents=documents,\n embeddings=embeddings,\n ids=[f\"doc_{i}\" for i in range(len(documents))],\n)\n\n# Step 2: Define the query function\ndef ask(question: str, n_results: int = 3) -> str:\n # Embed the question\n q_embedding = client.embeddings.create(\n input=question, model=EMBEDDING_MODEL\n ).data[0].embedding\n\n # Retrieve relevant chunks\n results = collection.query(query_embeddings=[q_embedding], n_results=n_results)\n context = \"\\n\".join(results[\"documents\"][0])\n\n # Generate answer using retrieved context\n response = client.chat.completions.create(\n model=CHAT_MODEL,\n messages=[\n {\n \"role\": \"system\",\n \"content\": (\n \"Answer the question based only on the provided context. \"\n \"If the context does not contain the answer, say \"\n \"'I don't have that information.'\"\n ),\n },\n {\n \"role\": \"user\",\n \"content\": f\"Context:\\n{context}\\n\\nQuestion: {question}\",\n },\n ],\n temperature=0,\n )\n return response.choices[0].message.content\n\n# Step 3: Ask questions\nprint(ask(\"What is the refund policy for digital products?\"))\n# \"Digital products can be refunded within 7 days of purchase.\n# Refunds are processed within 5-7 business days to the original\n# payment method.\"\n\nprint(ask(\"Do you offer student discounts?\"))\n# \"Yes, Acme Corp offers a 20% student discount with a valid .edu\n# email address.\"\n\nprint(ask(\"What programming languages does Acme support?\"))\n# \"I don't have that information.\"\n```\n\n\nLet's walk through what this code does.\n\nFirst, we create a ChromaDB collection and add 10 documents. Each document gets embedded and stored alongside its text. In a real system, these would be chunks from your actual documentation, not handwritten sentences.\n\nThe `ask` function implements the core RAG loop. It embeds the user's question using the same embedding model, queries ChromaDB for the 3 most similar documents, assembles them into a context string, and passes everything to the generation model with instructions to only use the provided context.\n\nNotice the last query: \"What programming languages does Acme support?\" None of the retrieved documents mention programming languages, so the prompt instructs the model to abstain. RAG does not automatically make a model honest; the retrieval boundary, prompt, and evaluation loop make abstention more likely and easier to test.\n\n---\n\n# From Demo to Production\n\nThis small system captures the core RAG pattern, but it cuts many corners that production systems cannot afford to cut.\n\n#### **No document loading or parsing**\n\nWe used 10 handwritten sentences. Real systems ingest PDFs, Markdown files, HTML pages, and database records. Each source type needs its own parsing logic.\n\n#### **No chunking**\n\nOur \"documents\" are single sentences. Real documents are pages or chapters that need careful splitting to balance context preservation with retrieval precision. Chunking strategy has a major effect on retrieval quality.\n\n#### **No metadata**\n\nWe do not track where each chunk came from, what page it was on, or when it was last updated. Without metadata, you cannot provide citations or filter by source.\n\n#### **No error handling**\n\nAPI calls fail. Embeddings take time. Vector stores can be unavailable. Production pipelines need retry logic, timeouts, and graceful fallback behavior.\n\n#### **No query preprocessing**\n\nUsers ask vague, misspelled, or multi-part questions. Production systems often rewrite or expand queries before retrieval to improve results.\n\n#### **No evaluation**\n\nWe have no way to measure whether the system is giving good answers. Without evaluation, you cannot tell whether a change improves or degrades answer quality.\n\nThese gaps are the difference between a demo and a system users can rely on. The next step is turning this baseline into a production RAG pipeline.\n\n---\n\n# Quiz\n\n---\n\n### References\n\n- [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) - The original RAG paper by Lewis et al.\n- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings) - Official documentation for text-embedding-3-small\n- [ChromaDB Documentation](https://docs.trychroma.com/) - Getting started with ChromaDB\n- [Anthropic's Contextual Retrieval](https://www.anthropic.com/news/contextual-retrieval) - Advanced techniques for improving RAG retrieval\n- [Pinecone: What is RAG?](https://www.pinecone.io/learn/retrieval-augmented-generation/) - Practical overview of RAG architecture","pageType":"ai-engineering"}

Get Premium