Building a RAG system is only half the work. The harder question is: how do you know it works, and how do you know it still works after the next change?
RAG quality comes from the whole pipeline. Parsing, chunking, embeddings, metadata filters, retrieval, reranking, prompt construction, model choice, and abstention logic can each fail independently. If the final answer is poor, saying "the model made an unsupported claim" is usually not specific enough to diagnose the problem.
This chapter covers practical evaluation: retrieval metrics, generation metrics, fixed evaluation datasets, automated judges, and regression workflows.