{"title":"What are Multimodal LLMs?","description":"","content":"ChatGPT can look at a photo of your fridge and suggest recipes. Claude can analyze charts and diagrams. GPT-4 can explain memes.\n\nA few years ago, this was science fiction. Language models could only process text. If you wanted image understanding, you needed a completely separate system.\n\nToday, a single model handles both. You can upload a screenshot of an error, paste your question, and get a coherent answer that references specific parts of the image.\n\nThese are Multimodal LLMs, and they represent one of the most significant advances in AI since the original Transformer.\n\nIn this article, we will break down what multimodal LLMs are, how they combine vision and language, and the two main approaches used to build them.\n\n---\n\n# What is a Multimodal LLM?\n\nA multimodal LLM is a large language model that can process multiple types of input (called \"modalities\") beyond just text.\n\nThe most common modalities are:\n\n- **Text** (the original LLM input)\n- **Images** (photos, screenshots, diagrams)\n- **Audio** (speech, music)\n- **Video** (sequences of images with optional audio)\n\nMost multimodal LLMs today focus on image + text. You provide an image and a text prompt, and the model generates a text response.\n\n\n[Embed: https://link.excalidraw.com/readonly/sxqJmAcKFsv2caHMrKFh](https://link.excalidraw.com/readonly/sxqJmAcKFsv2caHMrKFh)\n\n\n#### Example:\n\n- Input Image: [Photo of a cat on a laptop]\n- Input Text: \"What is happening in this image?\"\n- Output: \"A tabby cat is sitting on a laptop keyboard, looking at the screen.\"\n\nThe key insight is that the model does not just describe what it sees. It understands the relationship between the image and your question, allowing it to answer specific queries, extract information, or even explain complex diagrams.\n\n---\n\n# Why Do We Need Multimodal LLMs?\n\nText-only LLMs are powerful, but they are blind.\n\nConsider these tasks that text-only models cannot do:\n\n1. **Analyzing a chart:** \"What was the revenue in Q3?\"\n2. **Debugging from a screenshot:** \"Why is my CSS layout broken?\"\n3. **Reading a receipt:** \"How much did I spend on groceries?\"\n4. **Understanding memes:** \"Why is this funny?\"\n5. **Processing documents:** \"Extract the data from this PDF table\"\n\nBefore multimodal LLMs, solving these required separate specialized models: an OCR system for text extraction, a computer vision model for object detection, and an LLM for reasoning. You had to stitch them together yourself.\n\nMultimodal LLMs collapse this entire pipeline into a single model that can see, read, and reason in one step.\n\n---\n\n# The Core Challenge: Bridging Vision and Language\n\nHere is the fundamental problem: LLMs process sequences of tokens. They understand text because text is already discrete, made of words and characters.\n\nImages are different. An image is a grid of pixels, continuous values representing colors. There is no natural \"vocabulary\" for images like there is for text.\n\n\n```shell\nText: \"The cat sat on the mat\"\n [Token1][Token2][Token3][Token4][Token5][Token6]\n\nImage: [224 x 224 grid of RGB pixels]\n = 150,528 individual color values\n```\n\n\nSo how do you feed an image into a model designed for tokens?\n\nThe answer involves converting images into a sequence of \"visual tokens\" that the LLM can process alongside text tokens. The two main approaches differ in how they combine these visual tokens with the language model.\n\n---\n\n# How Images Are Converted to Tokens\n\nBefore we dive into the two main architectures, let us understand how images become tokens.\n\n### Step 1: Divide the Image into Patches\n\nJust like tokenizers split text into subwords, we split images into patches.\n\nA typical approach divides a 224x224 pixel image into a grid of 16x16 pixel patches:\n\n\n```html\n\n\n\n \n \n Image Patches - Light\n \n\n\n

Vision Transformer: Image Patches

Dividing images into patches like text tokens

\n \n

Original Image

Divided into Patches

\n \n

Hover over a patch to see details

\n \n

Image Size

224×224 pixels

Patch Size

16×16 pixels

Grid Size

14×14 patches

Total Patches

196 patches

Values per Patch

768 values

\n \n

\n How it works: Each 16×16 patch contains 256 pixels. With 3 color channels (RGB), that's 16×16×3 = 768 values per patch. These patches become the \"tokens\" that Vision Transformers process, just like words in text.\n

\n\n \n\n\n```\n\n\nEach 16x16 patch contains 256 pixels, and each pixel has 3 color channels (RGB), giving us 768 values per patch.\n\n### Step 2: Encode Patches with a Vision Encoder\n\nEach patch is then processed by a Vision Transformer (ViT), a neural network designed to understand image content.\n\nThe most common choice is CLIP (Contrastive Language-Image Pretraining), a model trained by OpenAI to connect images and text. CLIP was trained on millions of image-caption pairs, so it already \"understands\" images in a way that relates to language.\n\n\n```shell\nImage Patches (196 patches)\n |\n v\n+-------------------+\n| Vision Encoder |\n| (CLIP ViT) |\n+-------------------+\n |\n v\nImage Embeddings (196 vectors of size 768-1024)\n```\n\n\nAfter the vision encoder, each patch becomes a dense vector (typically 768 to 1024 dimensions) that captures the visual content of that region.\n\n### Step 3: Project to Match LLM Dimensions\n\nThe vision encoder outputs might not match the LLM's embedding dimension. A projector (usually a simple linear layer or small MLP) maps the image embeddings to the LLM's dimension.\n\n\n```shell\nImage Embeddings Projector LLM-Compatible\n(768 dimensions) --> (Linear) --> (4096 dimensions)\n```\n\n\nThink of the projector as a translator. It takes the \"visual language\" from the vision encoder and translates it into the \"language\" the LLM understands.\n\nNow we have a sequence of visual tokens that can be combined with text tokens. But how we combine them determines the architecture.\n\n---\n\n# Two Main Approaches\n\nThere are two dominant approaches to building multimodal LLMs:\n\n1. **Unified Embedding Decoder Architecture** (also called \"decoder-only\")\n2. **Cross-Modality Attention Architecture** (also called \"cross-attention based\")\n\nLet us explore each.\n\n---\n\n# Approach 1: Unified Embedding Decoder Architecture\n\nThis is the simpler and more common approach. The idea is straightforward: concatenate image tokens with text tokens and feed them all into a standard LLM.\n\n\n[Embed: https://link.excalidraw.com/readonly/7aiK6kJCRZHeHUjGI6c9](https://link.excalidraw.com/readonly/7aiK6kJCRZHeHUjGI6c9)\n\n\n### How It Works\n\n1. The image goes through a vision encoder (like CLIP) to produce patch embeddings\n2. A projector maps these embeddings to the LLM's dimension\n3. Text is tokenized and embedded normally\n4. Image embeddings and text embeddings are concatenated into one sequence\n5. The combined sequence is fed into a standard decoder-only LLM\n6. The LLM processes everything with self-attention and generates output\n\n### Why This Works\n\nThe LLM's self-attention mechanism can attend to both image and text tokens. When predicting the next token, the model can \"look at\" relevant parts of the image.\n\nFor example, if you ask \"What color is the car?\", the attention mechanism learns to focus on the image patches containing the car when generating the answer.\n\n### Advantages\n\n- **Simple to implement:** No changes to the LLM architecture needed\n- **Leverages existing LLMs:** You can use any pretrained decoder model\n- **Unified training:** All modalities trained together with the same objective\n\n### Disadvantages\n\n- **Context length consumption:** Image tokens use up your context window. 196 patches means 196 tokens just for one image.\n- **Computational cost:** Self-attention is O(n^2), so adding image tokens increases compute.\n\n### Models Using This Approach\n\n- **LLaVA** (Large Language and Vision Assistant)\n- **Molmo** (by Allen AI)\n- **Qwen-VL**\n- **Pixtral** (by Mistral)\n- **Fuyu** (by Adept)\n\n---\n\n# Approach 2: Cross-Modality Attention Architecture\n\nThe second approach keeps image and text more separate, connecting them through cross-attention layers.\n\n\n[Embed: https://link.excalidraw.com/readonly/dhzYYFC4GsWQceW1bDxG](https://link.excalidraw.com/readonly/dhzYYFC4GsWQceW1bDxG)\n\n\n### What is Cross-Attention?\n\nIn self-attention, each token attends to all other tokens in the same sequence. In cross-attention, tokens from one sequence attend to tokens from a different sequence.\n\nHere is the difference:\n\n\n```shell\nSelf-Attention (within text):\n Text Token: \"What\"\n Attends to: \"What\", \"color\", \"is\", \"the\", \"car\"\n\nCross-Attention (text attends to image):\n Text Token: \"car\"\n Attends to: [Image Patch 45], [Image Patch 46], [Image Patch 72]...\n (the patches that contain the car)\n```\n\n\nCross-attention was part of the original Transformer architecture for machine translation. The decoder (generating output) would cross-attend to the encoder (processing input). The same idea applies here, but instead of translating between languages, we are translating between vision and language.\n\n### How It Works\n\n1. Image patches go through the vision encoder and projector\n2. Text is tokenized and embedded normally\n3. Special cross-attention layers are added to the LLM\n4. In these layers, text tokens (queries) attend to image tokens (keys and values)\n5. This allows text generation to be informed by visual content\n\n### Advantages\n\n- **Context efficiency:** Image tokens do not consume the text context window\n- **Preserves text performance:** The base LLM weights can stay frozen, maintaining text-only capabilities\n- **Scales better:** Adding more images does not quadratically increase attention cost\n\n### Disadvantages\n\n- **Requires architecture changes:** You need to modify the LLM to add cross-attention layers\n- **More complex training:** Different components may require different training strategies\n\n### Models Using This Approach\n\n- **Llama 3.2 Vision** (11B and 90B)\n- **Flamingo** (by DeepMind)\n- **NVLM-X** (by NVIDIA)\n\n---\n\n# Comparing the Two Approaches\n\n\n| ##### Aspect | ##### Unified Decoder | ##### Cross-Attention |\n| --- | --- | --- |\n| **Architecture Changes** | None (standard LLM) | Requires new cross-attention layers |\n| **Context Usage** | Images consume context | Images separate from context |\n| **Text Performance** | May degrade slightly | Preserved if LLM is frozen |\n| **Implementation** | Simpler | More complex |\n| **High-Resolution Images** | Expensive (many tokens) | More efficient |\n| **Popular Models** | LLaVA, Qwen-VL, Pixtral | Llama 3.2, Flamingo |\n\n\nNVIDIA's NVLM paper actually tested both approaches and a hybrid (NVLM-H). Their findings:\n\n- **Unified Decoder (NVLM-D):** Better at OCR tasks requiring detailed text extraction\n- **Cross-Attention (NVLM-X):** More computationally efficient for high-resolution images\n- **Hybrid (NVLM-H):** Combines benefits of both\n\n---\n\n# Training a Multimodal LLM\n\nTraining typically happens in stages:\n\n### Stage 1: Pretraining (Alignment)\n\nThe goal is to align visual and text representations. Only the projector is trained while both the vision encoder and LLM stay frozen.\n\n\n```shell\nFrozen: Vision Encoder, LLM\nTrained: Projector only\n\nData: Large dataset of image-caption pairs\nGoal: Learn to map image features to LLM's embedding space\n```\n\n\nThis is a \"warm-up\" phase. The projector learns to translate visual features into a language the LLM already understands.\n\n### Stage 2: Instruction Finetuning\n\nNow we unfreeze the LLM and train on visual question-answering tasks.\n\n\n```shell\nFrozen: Vision Encoder (usually)\nTrained: Projector, LLM\n\nData: Visual QA, image captioning, visual reasoning\nGoal: Teach the model to follow visual instructions\n```\n\n\nThis is similar to how text-only LLMs are instruction-tuned. The model learns to answer questions about images, describe visual content, and reason about what it sees.\n\n### Stage 3: RLHF (Optional)\n\nSome models add reinforcement learning from human feedback to improve response quality and safety.\n\n\n[Embed: https://link.excalidraw.com/readonly/1DxObFHKnTrKYXBBhNBs](https://link.excalidraw.com/readonly/1DxObFHKnTrKYXBBhNBs)\n\n\n---\n\n# A Simpler Alternative: Patch-Only Models\n\nSome models skip the pretrained vision encoder entirely. Fuyu by Adept is a notable example.\n\nInstead of using CLIP:\n\n1. Split the image into patches\n2. Flatten each patch into a vector\n3. Project directly into the LLM's embedding space\n\n\n```shell\nImage Patches --> Linear Projection --> LLM\n\nNo separate vision encoder!\n```\n\n\nThe LLM itself learns to understand images from scratch during training. This simplifies the architecture but requires more training data and compute.\n\n---\n\n# Handling Different Image Resolutions\n\nA practical challenge: images come in different sizes, but models are typically trained on fixed resolutions (like 224x224 or 336x336).\n\nModern multimodal LLMs handle this in several ways:\n\n**Dynamic Resolution (Qwen-VL, Pixtral):** Accept images at their native resolution, creating more or fewer patches as needed.\n\n**Image Tiling:** Split large images into multiple tiles, process each separately, and combine the results.\n\n**Thumbnail + Details (NVLM-H):** Use a low-resolution thumbnail for global understanding, with high-resolution patches for fine details.\n\n---\n\n# What Multimodal LLMs Can Do\n\nWith vision and language combined, these models can:\n\n1. **Image Captioning:** Describe what is in an image\n2. **Visual Question Answering:** Answer specific questions about images\n3. **Document Understanding:** Extract information from PDFs, receipts, forms\n4. **Code Screenshot Analysis:** Debug UI issues from screenshots\n5. **Chart and Graph Interpretation:** Answer questions about visualized data\n6. **Meme and Comic Understanding:** Explain visual humor\n7. **Medical Image Analysis:** Assist in analyzing X-rays, scans (specialized models)\n8. **Diagram Explanation:** Describe flowcharts, architecture diagrams\n\n---\n\n# Current Limitations\n\nDespite rapid progress, multimodal LLMs still struggle with:\n\n- **Counting:** \"How many people are in this image?\" often produces wrong answers\n- **Spatial reasoning:** Precise location questions (\"What is to the left of X?\")\n- **Small text:** Reading tiny text in images remains challenging\n- **Hallucination:** Making up details that are not present in the image\n- **Complex reasoning:** Multi-step visual reasoning can fail\n\nThese limitations are active areas of research.\n\n---\n\n# Key Models to Know\n\n\n| ##### Model | ##### Approach | ##### Notable Features |\n| --- | --- | --- |\n| GPT-4V | Unknown (proprietary) | First major commercial multimodal LLM |\n| Claude 3.5 | Unknown (proprietary) | Strong document and code understanding |\n| Llama 3.2 Vision | Cross-attention | Open weights, 11B and 90B sizes |\n| Qwen2-VL | Unified decoder | Native resolution support |\n| LLaVA | Unified decoder | Open source pioneer |\n| Pixtral | Unified decoder | Native resolution, by Mistral |\n| Gemini | Unknown | Natively multimodal from the start |\n\n\n---\n\n# Summary\n\nMultimodal LLMs extend language models to understand images and other modalities. The core challenge is bridging the gap between continuous visual data and discrete text tokens.\n\n#### **Key components:**\n\n1. **Vision Encoder:** Converts image patches to embeddings (often CLIP)\n2. **Projector:** Maps visual embeddings to LLM dimension\n3. **LLM:** Processes combined visual and text tokens\n\n#### **Two main approaches:**\n\n1. **Unified Decoder:** Concatenate image and text tokens, feed to standard LLM\n2. **Cross-Attention:** Add cross-attention layers where text attends to image tokens\n\n#### **Training stages:**\n\n1. Pretrain projector (alignment)\n2. Instruction finetune (visual QA)\n3. Optional RLHF\n\nThe field is advancing rapidly. Models are getting better at understanding complex images, handling multiple images, and reasoning about visual content. As these models improve, the line between \"seeing\" and \"understanding\" continues to blur.\n\n---\n\n# Further Reading\n\n- [Understanding Multimodal LLMs](undefined) by Sebastian Raschka\n- [An Image is Worth 16x16 Words](undefined) (ViT paper)\n- [CLIP: Learning Transferable Visual Models](undefined)\n- [LLaVA: Visual Instruction Tuning](undefined)\n- [Llama 3 Herd of Models](undefined) (includes multimodal details)\n- [NVLM: Open Frontier-Class Multimodal LLMs](undefined)","pageType":"ai-engineering"}

What are Multimodal LLMs?

Ashish Pratap Singh

Get Premium