Most AI products do not work with plain text only. Users upload screenshots, photos, PDFs, recordings, tables, and videos. A good multimodal system has to understand each input, keep track of the evidence behind its answer, and choose a suitable model or tool for the job.
Multimodal applications support workflows such as screenshot troubleshooting, product search by photo, document analysis across text and diagrams, meeting summarization, video search, and voice agents with visual context.
This chapter covers the main building blocks: multimodal embeddings, cross-modal search, multimodal RAG, routing, and practical architecture choices.