AlgoMaster Logo

Vision Models and Image Understanding

Last Updated: March 15, 2026

Ashish

Ashish Pratap Singh

Vision models allow machines to interpret and reason about visual information such as images, screenshots, diagrams, and video frames. Instead of working only with text, these models can detect objects, read text inside images, understand layouts, and answer questions about visual content.

In modern AI applications, vision capabilities power features like document analysis, product recognition, medical image interpretation, UI understanding, and multimodal assistants.

In this chapter, you will learn how vision models work, what tasks they enable, and how to integrate image understanding into real-world AI systems.

How Vision-Language Models Work

Lets develop a mental model of what happens when you send an image to an LLM. You do not need to understand the deep math, but knowing the general flow will help you make better engineering decisions.

Traditional LLMs process a sequence of text tokens. Vision-language models (VLMs) extend this by adding a visual encoder that converts an image into a sequence of "visual tokens" that the language model can reason about alongside text tokens.

Here is what happens step by step:

  1. Image encoding. The image is resized and split into patches (small square regions, typically 14x14 or 16x16 pixels). A visual encoder, usually a Vision Transformer (ViT) or a similar architecture like SigLIP, processes these patches and produces a set of embedding vectors. Each vector represents a region of the image.
  2. Projection. These visual embeddings are projected into the same dimensional space as text token embeddings. This is the key trick. Once visual information lives in the same embedding space as text, the language model can attend to both using the same attention mechanism.
  3. Joint reasoning. The language model receives a combined sequence of visual tokens and text tokens. It processes them together using standard transformer attention. This means the model can relate specific parts of the image to specific parts of your text prompt, like connecting the word "total" in your prompt to the bottom of a receipt image.
  4. Text generation. The output is generated as normal text tokens. The model can describe what it sees, answer questions about the image, or extract structured data, all based on the combined visual and textual context.

The important takeaway for engineering: images are converted to tokens, just like text. More image detail means more tokens, which means higher cost and higher latency. A high-resolution image might consume 1,000+ tokens, while a low-resolution thumbnail might use only 85. This directly affects your API bill.

Sending Images to Vision APIs

Now let's get practical. Every major LLM provider supports image input, but the syntax and capabilities differ.

OpenAI (GPT-4o)

OpenAI accepts images in two ways: as a URL or as a base64-encoded string. The image goes inside the content array of a message, alongside text.

main.py
Loading...

Notice that the content field is no longer a simple string. It is an array of content blocks, each with a type field. You can mix text and images freely within a single message. You can even send multiple images in the same message.

For local images, you base64-encode them:

main.py
Loading...

The base64 approach is what you will use in production. Users upload images to your server, you encode them, and send them to the API.

Resolution, Tokens, and Cost

Vision models do not process your raw image at full resolution. They resize it, and the resize strategy directly determines how many tokens the image consumes, which determines your cost.

How OpenAI Handles Image Resolution

OpenAI offers a detail parameter with three levels:

  • `low`: The image is resized to 512x512 pixels. Always uses 85 tokens. Cheapest option.
  • `high`: The image is first scaled so its shortest side is 768 pixels, then split into 512x512 tiles. Each tile uses 170 tokens, plus a base cost of 85 tokens.
  • `auto` (default): The API chooses based on the image size.
main.py
Loading...

Here is a breakdown of how token cost scales with resolution in high-detail mode:

Scroll
Image SizeTilesToken CostRelative Cost
512x5121255 (170 + 85)1x
1024x10244765 (680 + 85)3x
2048x102481445 (1360 + 85)5.7x
4096x2048Capped~1700~6.7x

The cost difference between low and high detail is significant. For a task like "Is this a cat or a dog?", low detail is more than enough. For reading small text on a receipt, you need high detail. Choosing the right level per use case is one of the easiest optimizations you can make.

Cost Estimation Helper

Here is a utility function that estimates the token cost of an image before you send it:

main.py
Loading...

Practical Use Cases

Vision models can do far more than describe what is in a photo. Here are the four categories of tasks that come up most often in production applications, with working code for each.

1. Image Description and Captioning

The most straightforward use case. You send an image, the model describes it. This is useful for accessibility (alt text generation), content moderation, and cataloging.

main.py
Loading...

2. OCR and Text Extraction

Vision models are surprisingly good at reading text from images, often matching or exceeding traditional OCR tools like Tesseract. They handle messy handwriting, rotated text, unusual fonts, and text embedded in complex backgrounds.

main.py
Loading...

3. Structured Data Extraction

This is where vision models become really powerful.

main.py
Loading...

4. Diagram and UI Interpretation

This is one of the more surprising capabilities. Vision models can interpret architecture diagrams, flowcharts, wireframes, and UI screenshots. You can send a screenshot of an error dialog and ask the model to explain the error and suggest fixes.

main.py
Loading...

What Vision Models Get Wrong

Vision models are impressive, but they have consistent failure modes that you need to know about before shipping anything to production. Understanding these limitations will help you build appropriate guardrails.

Counting

Ask a vision model "How many apples are in this image?" when there are 13 apples, and you might get 11, 14, or 12. Vision models are unreliable at counting objects, especially when there are more than 5-6 items or when objects overlap. If your application requires accurate counts, do not trust the model's answer without a fallback mechanism.

Spatial Reasoning

Models struggle with precise spatial relationships. "Is the red box to the left or right of the blue box?" works for simple layouts. "Which of these 10 items is closest to the top-right corner?" often fails. The model has a rough sense of spatial layout, but not pixel-level precision.

Small Text

Even in high-detail mode, very small text (fine print at the bottom of a document, tiny labels on a complex diagram) can be missed or misread. If you need to read small text, crop and enlarge the relevant region before sending it to the model.

Hallucination

This is the most dangerous failure mode. The model might confidently read text that is not there, especially with low-quality images. A blurry receipt might produce a plausible-looking but completely wrong total. A partially obscured serial number might be "completed" by the model with invented digits.

Math from Images

If an image contains a math equation or a table of numbers, the model might read the numbers correctly but calculate incorrectly when asked to add them up. Always do arithmetic in your application code, not in the model's response.

Defensive Coding for Vision

Given these limitations, here are practical patterns for production use:

main.py
Loading...

By including a confidence field and asking the model to flag uncertain extractions, you give your application a signal to request human review when needed. This is much better than blindly trusting every output.

Putting It All Together: Receipt Processor

Let's combine everything into a complete receipt processing pipeline. This system takes a photo of a receipt and extracts structured data, handling poor image quality and validating the results.

Step 1: Define the Schema

main.py
Loading...

A few design decisions worth noting. The model_validator cross-checks the total against the sum of line items. It does not reject the extraction, it flags the inconsistency. This is the right approach because the model might have read the total correctly but missed or misread a line item. Rejecting the whole thing would lose useful data. The image_quality_notes field lets the model communicate uncertainty, which your application can use to decide whether to request human review.

Step 2: Build the Extractor

main.py
Loading...

Step 3: Add Preprocessing and Error Handling

main.py
Loading...

Step 4: Run It

main.py
Loading...

This pipeline handles the full lifecycle: input validation, image preprocessing to optimize token usage, structured extraction with schema validation, cross-field consistency checks, and graceful error reporting. It is production-ready in the sense that it fails gracefully and communicates uncertainty, two properties that most prototypes lack.

References