Most RAG systems start with text: documents, paragraphs, tickets, or code snippets. Real knowledge bases are messier. Useful evidence often lives in tables, charts, diagrams, screenshots, scanned PDFs, audio, and video.
Multimodal RAG extends retrieval beyond plain text. The system extracts, describes, embeds, retrieves, and cites information from different content types, then passes the relevant evidence to a model.
This chapter covers practical strategies for PDFs, tables, images, audio, video-derived content, multimodal embeddings, and the cost trade-offs that decide how far to go.