Last Updated: March 14, 2026
Embeddings are the foundation of many modern AI systems, from semantic search and recommendation engines to retrieval-augmented generation (RAG). But not all embedding models are the same. Different models are optimized for different tasks, languages, and performance constraints.
Choosing the right embedding model can have a significant impact on the quality of your application. A model that works well for short search queries may perform poorly on long documents. A model optimized for English text may struggle with multilingual data. Some models prioritize accuracy, while others are designed to be smaller, faster, and cheaper to run.
In this chapter, we will explore how to choose the right embedding model for your use case.
Most searches for "best embedding model" lead to the MTEB leaderboard. MTEB stands for Massive Text Embedding Benchmark, and it is the closest thing the embedding world has to a standardized test. Understanding how to read it (and where it falls short) is the first skill you need.
MTEB evaluates embedding models across multiple task categories, each testing a different capability:
The most important category for AI engineering work is Retrieval. This is what matters when you are building RAG systems, search engines, or recommendation systems. It measures how well a model can find relevant documents given a query. The other categories matter too, but if you are choosing an embedding model for a search or RAG application, retrieval performance should be your primary filter.
Here is how to interpret a typical MTEB leaderboard entry:
Do not sort by overall score and pick the top model. The overall score averages across all task categories, including ones you probably do not care about. A model might rank first overall because it dominates at clustering and classification, while being mediocre at retrieval.
Instead, follow this process:
Here is where things get tricky. MTEB is useful, but it has blind spots that can lead you to a bad choice if you rely on it exclusively.
MTEB benchmarks use general-purpose datasets like MS MARCO (web search queries) and NQ (Wikipedia questions). If your data is medical records, legal contracts, or source code, a model that scores well on MTEB might perform poorly on your specific domain. There is no substitute for testing on your own data.
MTEB does not measure how fast a model generates embeddings. A model with 3072 dimensions takes longer to embed text and longer to search than one with 384 dimensions, but the leaderboard does not capture that.
Some top-performing models are API-based and cost money per token. Others are open-source and free to run (if you have the hardware). MTEB treats them equally.
Most MTEB benchmarks are English-focused. If your application serves multiple languages, you need to look at MTEB's multilingual subsets separately.
The bottom line: use MTEB as a starting point to narrow your options from hundreds to five or six candidates. Then benchmark those candidates on your own data. We will cover exactly how to do that later in this chapter.
Every embedding model outputs vectors of a fixed size, and this number matters more than most people realize. It affects storage costs, search speed, memory usage, and retrieval quality. Here is what the common dimension sizes mean in practice.
Those per-vector sizes look tiny, but they add up fast. If you have 10 million documents, here is what your vector storage looks like:
Going from 384 to 3072 dimensions means 8x more storage and 8x more RAM. For a startup with 100,000 documents, the difference is negligible. For a company with 100 million documents, the difference is thousands of dollars per month in infrastructure costs.
More dimensions generally mean better retrieval quality, but the relationship is not linear. The jump from 384 to 768 dimensions gives a significant quality boost. The jump from 1536 to 3072 gives a much smaller improvement, often just 1-2% on benchmarks.
The quality improvements in the diagram above are approximate and vary by dataset, but the pattern holds: diminishing returns as dimensions increase. For most applications, 768 or 1024 dimensions hit the sweet spot between quality and efficiency.
Some newer models support a technique called Matryoshka Representation Learning (MRL). The idea is clever: the model is trained so that the first N dimensions of a larger embedding are themselves a valid, useful embedding. You can take a 3072-dimensional vector, truncate it to 1024 dimensions, and still get good retrieval quality.
OpenAI's text-embedding-3-large supports this. You can request any dimension size up to 3072:
This gives you flexibility. You can start with 1024 dimensions to keep costs low, and if you need better quality later, re-embed with 3072 dimensions without switching models.
If you do not want to manage GPU infrastructure, API-based models are the simplest path. You send text, you get vectors back. Here are the three main providers worth considering.
OpenAI offers two embedding models:
OpenAI's biggest advantage is simplicity. The API is clean, the documentation is solid, and you are probably already using OpenAI for your LLM calls. The downside is that your embeddings are tied to their infrastructure, and you cannot run these models locally.
Cohere's embed-v3 models are specifically designed for search and retrieval:
What makes Cohere unique is the input_type parameter. You tell the model whether you are embedding a search query or a document, and it adjusts the embedding accordingly:
This asymmetric embedding approach often produces better search results because queries and documents are fundamentally different. Queries are short and intent-driven. Documents are long and information-dense. Treating them differently makes sense.
Voyage AI is the specialist. They build domain-specific embedding models that outperform general-purpose ones in their target domain:
If you are building a code search engine or a legal document retrieval system, Voyage's domain-specific models are worth testing. They often beat larger, more expensive general-purpose models on in-domain tasks.
API models are convenient, but they come with trade-offs: ongoing costs, network latency, data privacy concerns, and vendor lock-in. Open-source models solve all of these. You download the model, run it on your own hardware, and pay nothing per request.
Here are the models that consistently appear at the top of the MTEB leaderboard:
A clear pattern emerges: smaller models (137M-568M parameters) can run on a CPU and still deliver good results. The 7B-parameter models need a GPU but rival API-based models in quality.
The sentence-transformers library makes running these models straightforward:
Output:
A few things to note. The normalize_embeddings=True parameter ensures all vectors have unit length, which makes cosine similarity equivalent to a simple dot product. This is a small optimization that matters at scale. Also, BGE models expect a specific instruction prefix for queries:
This is similar to Cohere's asymmetric approach, just implemented through text prefixes rather than a separate parameter.
The decision often comes down to three factors. If data privacy is non-negotiable (healthcare, finance, government), open-source models that run on your infrastructure are the only option. If you are processing millions of documents, the cost savings of open-source models compound quickly. A model that costs $0.10 per million tokens adds up to thousands of dollars when you are embedding 100 million tokens. Running the same model locally costs only the electricity.
On the other hand, if you need the absolute highest quality and your volume is moderate (under a few million embeddings), API models are hard to beat. You get strong performance with zero infrastructure overhead.
Everything we have discussed so far deals with text. But what if your application involves images, or a mix of text and images?
Text-only embeddings work when both your queries and your documents are text. But some applications need to bridge modalities:
For these use cases, you need a model that maps both text and images into the same embedding space. When a text query and a relevant image end up near each other in this shared space, you can do cross-modal retrieval.
OpenAI's CLIP (Contrastive Language-Image Pre-training) was the model that made multimodal embeddings mainstream. It was trained on 400 million image-text pairs from the internet, learning to place matching text and images close together in a shared 512-dimensional space.
Today, several multimodal embedding options exist:
For most AI engineering work, text-only embeddings are what you need. RAG systems, semantic search over documents, chatbot knowledge bases, these are all text-to-text. Multimodal models come with trade-offs: they tend to score lower on text-only retrieval tasks compared to specialized text embedding models, and they are more expensive to run.
Use multimodal embeddings only when you actually need cross-modal search. If your documents are pure text and your queries are pure text, stick with a text-only model. You will get better results.
A general-purpose model trained on web text, Wikipedia, and news articles will struggle with legal jargon, medical terminology, or source code. The words mean different things. "Injection" in a medical context has nothing to do with "SQL injection." "Cell" in biology is unrelated to "cell" in a spreadsheet. General models learned from broad data, and they will conflate these meanings.
The failure mode is subtle. Your search system will return results, they just will not be the right results. A developer searching for "memory leak in connection pool" in a codebase might get results about "memory" in a general sense instead of the specific pattern of unreleased database connections. The embeddings are close enough to return something, but not precise enough to return the right thing.
You have three paths, listed from least effort to most:
Option 1: Check if someone has already trained a model for your domain. Voyage AI has models for code, law, and finance. Hugging Face has medical models like BiomedBERT and PubMedBERT. If a domain-specific model exists, it will almost certainly outperform a general-purpose model on your data.
Option 2 is fine-tuning. You take a strong general-purpose model like BGE-large or E5-large and continue training it on your domain data. This requires labeled pairs (query, relevant document), and typically 1,000 to 10,000 examples to see meaningful improvement. The sentence-transformers library makes this relatively straightforward.
Option 3, training from scratch, is almost never the right answer for most teams. It requires millions of training examples and significant compute. Unless you are working in an extremely niche domain with its own language (think molecular biology notation or ancient language scripts), fine-tuning an existing model will get you 90% of the way there.
The only reliable way to know if a model works for your domain is to test it on your domain. Here is a simple evaluation protocol:
This takes a few hours of manual work to build the evaluation set, but it will save you from deploying a model that looks good on MTEB but fails on your data.
Theory is useful, but eventually you need numbers. This section walks you through benchmarking embedding models on your own data, measuring the three things that matter: retrieval quality, latency, and cost.
Two metrics dominate embedding evaluation:
Mean Reciprocal Rank (MRR) measures where the first relevant document appears in your ranked results. If the correct document is ranked first, the reciprocal rank is 1. If it is ranked third, the reciprocal rank is 1/3. MRR is the average of these across all queries.
Recall@K measures how often the correct document appears anywhere in the top K results. Recall@10 asks: "Is the right answer somewhere in the top 10?" This is particularly important for RAG systems, where you feed the top K documents to an LLM. If the right document is not in that set, the LLM cannot use it.
Here is a complete benchmarking script that compares multiple embedding models:
And here is how you use it with sample data:
Output:
The output above is illustrative. With only a few examples, most models will get perfect scores. The differences become visible when you test on hundreds of query-document pairs with many distractor documents.