{"title":"Choosing Embedding Models","description":"","content":"An embedding model is part of your retrieval infrastructure. Once you choose one, every vector in your index is tied to that model, its dimension setting, its similarity behavior, and sometimes its query/document formatting rules.\n\nThat choice affects relevance, latency, storage cost, migration cost, multilingual support, and privacy. A model that works well for short English support tickets may be weak on code search, legal clauses, table-heavy PDFs, or multilingual customer content.\n\nThis chapter gives you a practical way to choose: use public benchmarks to make a short list, then test those candidates on your own retrieval task.\n\n---\n\n# MTEB: A Starting Point\n\nMany searches for \"best embedding model\" lead to the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). MTEB stands for Massive Text Embedding Benchmark. It is useful, but it is not a final decision tool. Use it to find candidates, then evaluate those candidates on your own data.\n\n### What MTEB Actually Measures\n\nMTEB evaluates embedding models across multiple task categories, each testing a different capability:\n\n\n```mermaid\nflowchart TD\n MTEB[\"MTEB Benchmark\"]:::primary\n\n subgraph Tasks[\"Task Categories\"]\n RET[\"Retrieval
Find relevant docs\"]:::orange\n STS[\"Semantic Similarity
Score sentence pairs\"]:::orange\n CLF[\"Classification
Categorize text\"]:::orange\n CLU[\"Clustering
Group similar texts\"]:::orange\n RERANK[\"Reranking
Reorder search results\"]:::orange\n SUMM[\"Summarization
Score summary quality\"]:::orange\n PAIR[\"Pair Classification
Detect paraphrases\"]:::orange\n end\n\n MTEB --> RET\n MTEB --> STS\n MTEB --> CLF\n MTEB --> CLU\n MTEB --> RERANK\n MTEB --> SUMM\n MTEB --> PAIR\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n```\n\n\nFor most AI engineering search work, the most important category is **Retrieval**. This is the category that matters for RAG systems, semantic search, and many recommendation systems. It measures how well a model finds relevant documents for a query. The other categories can matter, but if you are choosing a model for search or RAG, retrieval performance should be your first filter.\n\nA typical MTEB leaderboard entry has these columns:\n\n\n| Column | What It Means |\n|--------|---------------|\n| Model | Name and link to the model |\n| Overall Score | Average across all task categories |\n| Retrieval | Performance on search/retrieval tasks specifically |\n| Model Size | Number of parameters (affects speed and memory) |\n| Embedding Dimensions | Length of the output vector |\n| Max Tokens | Maximum input length the model can handle |\n\n\n### How to Read It Effectively\n\nDo not sort by overall score and automatically pick the top model. The overall score averages across task categories, including categories that may not matter for your product. A model can rank well overall while being only average for retrieval.\n\nInstead, follow this process:\n\n1. **Filter by your task.** If you are building search or RAG, sort by the Retrieval column.\n2. **Check the model size.** A 7-billion-parameter model may score slightly higher than a 300-million-parameter model, but it can be much slower and may require a GPU. Decide whether the quality gain is worth the latency and operating cost.\n3. **Look at the embedding dimensions.** More dimensions mean more storage and memory bandwidth. They do not automatically mean better retrieval.\n4. **Check the max token length.** Longer context can reduce truncation, but it does not eliminate the need for good chunking. Very long chunks often retrieve poorly because they mix too many topics.\n5. **Check the model contract.** Some models require different settings or prefixes for queries and documents. Ignoring that detail can cost more quality than switching models.\n\n### What MTEB Misses\n\nMTEB is useful, but it has blind spots. If you rely on it alone, you can still choose the wrong model.\n\n#### **Domain-specific performance**\n\nMTEB includes many general-purpose datasets, such as web search and question-answering collections. If your data is medical records, legal contracts, source code, support tickets, or internal docs, a model that scores well on public benchmarks may still perform poorly for your domain. There is no substitute for testing on your own data.\n\n#### **Latency and throughput**\n\nMTEB does not tell you how fast a model will be in your system. A large model is usually slower to embed text, and higher-dimensional vectors usually cost more to store and search. The leaderboard does not capture those costs.\n\n#### **Cost**\n\nSome high-scoring models are API-based and charge per token. Others are open-weight models that you can run yourself, but you still pay for hardware, engineering time, and operations. MTEB does not decide that trade-off for you.\n\n#### **Multilingual performance**\n\nMost MTEB benchmarks are English-focused. If your application serves multiple languages, you need to look at MTEB's multilingual subsets separately.\n\nUse MTEB to narrow a large field to a few candidates. Then run a retrieval evaluation with your corpus, your queries, your filters, your chunking, and your latency budget.\n\n---\n\n# Dimensionality Trade-offs\n\nEvery embedding model outputs vectors of a certain size. That size affects storage, search speed, memory use, and sometimes retrieval quality. Common sizes look like this in practice.\n\n### Common Dimension Sizes\n\n\n| Dimensions | Example Models | Vector Size (float32) | Practical Note |\n|------------|----------------|----------------------|---------------|\n| 384 | all-MiniLM-L6-v2, BGE-small | 1.5 KB | Useful for prototypes, smaller indexes, and simple retrieval |\n| 768 | BGE-base, E5-base, GTE-base | 3 KB | Common open-weight middle ground |\n| 1024 | BGE-large, Voyage 4 defaults | 4 KB | Strong retrieval with manageable storage |\n| 1536 | OpenAI text-embedding-3-small, Cohere embed-v4.0 default | 6 KB | Common API-model range |\n| 3072 | OpenAI text-embedding-3-large | 12 KB | Expensive at scale; benchmark before committing |\n\n\nThose per-vector sizes look small, but they add up quickly. At 10 million vectors, raw vector storage looks like this:\n\n\n| Dimensions | Storage for 10M Vectors | RAM for Index |\n|------------|------------------------|---------------|\n| 384 | ~15 GB | ~18 GB |\n| 768 | ~30 GB | ~35 GB |\n| 1024 | ~40 GB | ~47 GB |\n| 1536 | ~60 GB | ~70 GB |\n| 3072 | ~120 GB | ~140 GB |\n\n\nGoing from 384 to 3072 dimensions means 8x more raw vector storage and memory. For 100,000 chunks, that may not matter much. For 100 million chunks, it can become a serious infrastructure cost.\n\n### The Quality vs. Efficiency Curve\n\nMore dimensions can help, but the relationship is model-specific and not linear. A well-trained 1024-dimensional model can beat a weaker 3072-dimensional model. The better question is cost-adjusted quality: how much recall and precision you get for each dollar, millisecond, and gigabyte.\n\n\n```mermaid\nflowchart LR\n D384[\"384-dim
Fast, cheap
Often enough\"]:::green\n D768[\"768-dim
Common middle ground\"]:::primary\n D1024[\"1024-dim
Strong quality
Reasonable cost\"]:::primary\n D1536[\"1536-dim
High quality
Higher cost\"]:::orange\n D3072[\"3072-dim
Highest storage cost\"]:::red\n\n D384 -->|\"+15% quality\"| D768\n D768 -->|\"+5% quality\"| D1024\n D1024 -->|\"+3% quality\"| D1536\n D1536 -->|\"+1% quality\"| D3072\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n```\n\n\nThe quality numbers in the diagram are illustrative. In practice, dimension count, model family, quantization, reranking, and chunking interact. For many systems, 768 to 1536 dimensions is a reasonable range to test first.\n\n### Matryoshka Embeddings: Flexible Dimensionality\n\nSome models support flexible dimensionality through Matryoshka-style training. The model is trained so that shorter prefixes of the vector remain useful. You can request or store fewer dimensions and trade some quality for lower memory use and faster search.\n\nOpenAI's `text-embedding-3-large` supports this pattern through the `dimensions` parameter. For example:\n\n\n**main.py**\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI()\n\n# Full 3072 dimensions\nfull_embedding = client.embeddings.create(\n input=\"How does photosynthesis work?\",\n model=\"text-embedding-3-large\",\n dimensions=3072\n)\n\n# Shortened to 1024 dimensions\ncompact_embedding = client.embeddings.create(\n input=\"How does photosynthesis work?\",\n model=\"text-embedding-3-large\",\n dimensions=1024\n)\n\nprint(f\"Full dimensions: {len(full_embedding.data[0].embedding)}\")\nprint(f\"Compact dimensions: {len(compact_embedding.data[0].embedding)}\")\n```\n\n\nThis gives you flexibility, but it does not remove migration work. A 1024-dimensional index and a 3072-dimensional index are different indexes. If you change the dimension setting after launch, plan a re-embedding and reindexing job.\n\n---\n\n# API-Based Embedding Models\n\nIf you do not want to run model serving infrastructure, API-based models are the simplest path. You send text and receive vectors. The trade-off is ongoing cost, provider dependency, network latency, and data leaving your environment unless you use an approved private deployment option.\n\n### OpenAI\n\nOpenAI's `text-embedding-3` family is a common baseline:\n\n- **text-embedding-3-small** (1536 dimensions by default): A cost-conscious baseline for many retrieval systems. It supports lower output dimensions through the `dimensions` parameter.\n- **text-embedding-3-large** (3072 dimensions by default): A higher-capacity option. It is worth testing for nuanced or multilingual retrieval, but the storage and memory cost is higher.\n\n\n**main.py**\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI()\n\nresponse = client.embeddings.create(\n input=[\"What is machine learning?\", \"How do neural networks work?\"],\n model=\"text-embedding-3-small\"\n)\n\nfor i, item in enumerate(response.data):\n print(f\"Text {i}: {len(item.embedding)} dimensions\")\n```\n\n\nOpenAI's advantages are a simple API, flexible dimensions, and broad ecosystem support. The downside is that you cannot run these models locally, and changing models or dimensions requires re-embedding the corpus.\n\n### Cohere\n\nCohere's embedding line includes `embed-v4.0`, which is notable because it supports text, images, and mixed text/image inputs such as PDFs:\n\n- **embed-v4.0** (1536 dimensions by default, with 256/512/1024/1536 options): General text and multimodal embedding model with long context support.\n- **embed-english-v3.0** and **embed-multilingual-v3.0** (1024 dimensions): Older v3 options that may still be useful when you have existing indexes or specific compatibility needs.\n\nOne important Cohere feature is the `input_type` parameter. You tell the model whether you are embedding a search query or a document, and the model uses the right representation for that role:\n\n\n**main.py**\n\n```python\nimport cohere\n\nco = cohere.ClientV2()\n\n# Embed documents (what you store in your database)\ndoc_response = co.embed(\n texts=[\"Machine learning is a subset of artificial intelligence.\"],\n model=\"embed-v4.0\",\n input_type=\"search_document\",\n embedding_types=[\"float\"]\n)\n\n# Embed queries (what users search for)\nquery_response = co.embed(\n texts=[\"What is ML?\"],\n model=\"embed-v4.0\",\n input_type=\"search_query\",\n embedding_types=[\"float\"]\n)\n```\n\n\nThis asymmetric embedding approach can improve retrieval because queries and documents are different kinds of text. Queries are usually short and intent-driven. Documents are longer and more information-dense. Treating them identically is convenient, but it is not always the right choice.\n\n### Voyage AI\n\nVoyage AI is retrieval-focused and offers both general-purpose and domain-specific embedding models:\n\n- **voyage-4-large / voyage-4 / voyage-4-lite** (1024 dimensions by default, with flexible dimensions): General-purpose and multilingual models in the 4-series.\n- **voyage-code-3** (1024 dimensions by default, with flexible dimensions): Optimized for code search and retrieval.\n- **voyage-law-2** (1024 dimensions): Trained for legal retrieval and RAG.\n- **voyage-finance-2** (1024 dimensions): Trained for financial retrieval and RAG.\n\nIf you are building code search, legal retrieval, or financial document search, domain-specific models are worth testing against your own data. Do not assume they will win everywhere. Treat them as strong candidates.\n\n### Provider Comparison\n\n\n| Feature | OpenAI | Cohere | Voyage AI |\n|---------|--------|--------|-----------|\n| General Model Example | text-embedding-3-large | embed-v4.0 | voyage-4-large / voyage-4 |\n| Default Dimensions | 3072 (large), 1536 (small) | 1536 | 1024 |\n| Flexible Dimensions | Yes | Yes | Yes on 4-series and selected models |\n| Multilingual | Yes | Yes | Yes on 4-series general models |\n| Query/Document Mode | No explicit parameter | Yes (`input_type`) | Yes (`input_type`) |\n| Multimodal Embeddings | No for text-embedding models | Yes (`embed-v4.0`) | Yes via separate multimodal models |\n| Domain-Specific Models | No | No | Yes (code, law, finance) |\n\n\n---\n\n# Open-Weight Embedding Models\n\nAPI models are convenient, but they come with trade-offs: ongoing costs, network latency, data handling constraints, and migration risk. Open-weight models give you more control. You can run them locally or in your own cloud account, tune serving, and avoid per-token API charges.\n\n### Common Open-Weight Options\n\nThe open model landscape changes quickly. These families are useful starting points, not a permanent ranking:\n\n\n| Model | Dimensions | Parameters | Retrieval Profile | Max Tokens | Good Fit |\n|-------|-----------|------------|---------------|------------|----------|\n| bge-large-en-v1.5 | 1024 | 335M | Strong baseline | 512 | General English retrieval |\n| bge-m3 | 1024 | 568M | Strong multilingual candidate | 8192 | Multilingual and longer documents |\n| e5-large-v2 | 1024 | 335M | Strong baseline | 512 | General retrieval |\n| e5-mistral-7b-instruct | 4096 | 7B | High-capacity candidate | 32768 | Quality-sensitive workloads with GPU capacity |\n| gte-Qwen embedding variants | varies | varies | High-capacity candidates | varies | Multilingual or long-context retrieval |\n| nomic-embed-text-v1.5 | 768 | 137M | Efficient candidate | 8192 | Long documents and lower-resource serving |\n\n\nSmaller models are often good enough and cheaper to serve. Larger embedding models can improve difficult retrieval tasks, but they usually need GPU capacity, batching, and more operational care.\n\n### Running Open-Weight Models Locally\n\nThe `sentence-transformers` library makes many open-weight embedding models easy to try:\n\n\n**main.py**\n\n```python\nfrom sentence_transformers import SentenceTransformer\n\n# Load a model. It downloads the first time you run it.\nmodel = SentenceTransformer(\"BAAI/bge-large-en-v1.5\")\n\n# Embed some text\nsentences = [\n \"What is transfer learning?\",\n \"Transfer learning reuses knowledge from one task for another.\",\n \"The weather is nice today.\"\n]\n\nembeddings = model.encode(sentences, normalize_embeddings=True)\n\nprint(f\"Shape: {embeddings.shape}\")\nprint(f\"Dimensions per vector: {embeddings.shape[1]}\")\n```\n\n\n**Output:**\n\n\n```shell\nShape: (3, 1024)\nDimensions per vector: 1024\n```\n\n\nThe `normalize_embeddings=True` parameter makes all vectors unit length. With normalized vectors, cosine similarity is equivalent to a dot product, which is convenient for many vector indexes. Some BGE models also expect a specific instruction prefix for retrieval queries:\n\n\n**main.py**\n\n```python\n# Some BGE models use a query instruction prefix for retrieval\nquery_embedding = model.encode(\n [\"Represent this sentence for searching relevant passages: What is transfer learning?\"],\n normalize_embeddings=True\n)\n\n# Documents do NOT get the prefix\ndoc_embeddings = model.encode(\n [\"Transfer learning reuses knowledge from one task for another.\"],\n normalize_embeddings=True\n)\n```\n\n\nThis is similar in spirit to Cohere's asymmetric approach, but it is implemented through text prefixes rather than a separate API parameter.\n\n### When to Choose Open-Weight Models\n\n\n```mermaid\nflowchart TD\n START[\"Choosing Between
API vs Open-Weight\"]:::primary\n\n Q1{\"Data privacy
concerns?\"}:::orange\n Q2{\"Budget for
API costs?\"}:::orange\n Q3{\"GPU
available?\"}:::orange\n Q4{\"Need highest
measured quality?\"}:::orange\n\n API[\"Use API Model
(OpenAI, Cohere, Voyage)\"]:::green\n SMALL[\"Use Small Open-Weight
(BGE-base, nomic-embed)
CPU-friendly\"]:::teal\n LARGE[\"Use Large Open-Weight
(e5-mistral-7b, gte-Qwen)
GPU required\"]:::teal\n\n START --> Q1\n Q1 -->|\"Yes\"| Q3\n Q1 -->|\"No\"| Q2\n Q2 -->|\"Yes\"| Q4\n Q2 -->|\"No\"| Q3\n Q3 -->|\"Yes\"| LARGE\n Q3 -->|\"No\"| SMALL\n Q4 -->|\"Yes\"| API\n Q4 -->|\"No\"| SMALL\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n```\n\n\nThe decision often comes down to control, cost, and operational capacity. If data residency or privacy is non-negotiable, running an open-weight model inside your environment may be required. If you process large volumes, self-hosting can reduce marginal cost, but the cost does not disappear: you still pay for GPUs or CPUs, engineering time, monitoring, and upgrades.\n\nIf volume is moderate and your team does not want to operate model serving, API models are often the better engineering choice. Optimize for total system cost, not only the listed price of the model.\n\n---\n\n# Text-Only vs. Multimodal Embeddings\n\nEverything so far has focused on text. Some applications also need embeddings for images, screenshots, scanned documents, charts, or PDFs where the visual layout carries meaning.\n\n### When You Need Multimodal Embeddings\n\nText-only embeddings work when both the query and the searchable content are text. Some applications need to connect text with images or mixed documents:\n\n- **Image search with text queries:** \"Find photos of sunset over the ocean\"\n- **E-commerce:** Match product descriptions to product images\n- **Document understanding:** Search through PDFs that contain charts, diagrams, and text\n\nFor these use cases, you need a model that maps text and images into a compatible embedding space. When a text query and a relevant image end up near each other in that space, you can do cross-modal retrieval.\n\n### CLIP and Its Descendants\n\nOpenAI's CLIP (Contrastive Language-Image Pre-training) helped popularize multimodal embeddings. It learned a shared space for matching images and text, which made text-to-image and image-to-image retrieval practical.\n\n\n```mermaid\nflowchart LR\n subgraph Input\n TEXT[\"'A cat sitting
on a couch'\"]:::primary\n IMG[\"Image of a cat
on a couch\"]:::orange\n end\n\n subgraph Encoders\n TE[\"Text
Encoder\"]:::teal\n IE[\"Image
Encoder\"]:::teal\n end\n\n subgraph SharedSpace[\"Shared Embedding Space\"]\n TV[\"Text Vector
[0.21, -0.45, ...]\"]:::primary\n IV[\"Image Vector
[0.19, -0.43, ...]\"]:::orange\n end\n\n SIM[\"High Cosine
Similarity\"]:::green\n\n TEXT --> TE\n IMG --> IE\n TE --> TV\n IE --> IV\n TV --> SIM\n IV --> SIM\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n```\n\n\nToday, several multimodal embedding options exist:\n\n\n| Model | Modalities | Dimensions | Access |\n|-------|-----------|------------|--------|\n| CLIP (OpenAI) | Text, Image | 512 | Open-source |\n| SigLIP (Google) | Text, Image | 1152 | Open-source |\n| Cohere embed-v4.0 | Text, Image, mixed document inputs | 1536 default | API |\n| Voyage multimodal models | Text, Image | varies | API |\n| Jina CLIP v2 | Text, Image | 1024 | Open-source + API |\n\n\n### Should You Use Multimodal Embeddings?\n\nFor most document RAG systems, a strong text embedding model is still the right default. Use multimodal embeddings when the retrieval target includes visual information: product images, screenshots, scanned documents, charts, diagrams, or PDFs where layout and figures carry meaning. If you can extract reliable text and users search with text, benchmark text-only retrieval before adding the complexity of multimodal indexing.\n\n---\n\n# Domain-Specific vs. General-Purpose Models\n\nA general-purpose model can struggle with legal clauses, clinical notes, source code, financial filings, or internal terminology. The same word can mean different things in different domains. \"Injection\" in medicine is not \"SQL injection.\" A \"cell\" in biology is not a spreadsheet cell.\n\n### When General-Purpose Models Fail\n\nThe failure mode is often subtle. The system returns plausible results, but not the right ones. A developer searching for \"memory leak in connection pool\" may get generic memory-management docs instead of the specific code path that fails to release database connections. Semantic similarity is not the same as task relevance.\n\n### Your Options for Domain-Specific Embedding\n\nYou have three paths, listed from least effort to most:\n\n\n```mermaid\nflowchart TD\n PROBLEM[\"General model
underperforms on
your domain\"]:::red\n\n OPT1[\"Option 1: Domain-Specific
Pre-trained Model
(Lowest effort)\"]:::green\n OPT2[\"Option 2: Fine-tune
a General Model
(Moderate effort)\"]:::orange\n OPT3[\"Option 3: Train
From Scratch
(Rarely needed)\"]:::red\n\n EX1[\"Voyage-code-3 for code
Voyage-law-2 for legal
PubMedBERT for medical\"]:::teal\n EX2[\"Fine-tune a strong model
on labeled query-document pairs\"]:::teal\n EX3[\"Only for unusual cases
with enough data and compute\"]:::teal\n\n PROBLEM --> OPT1\n PROBLEM --> OPT2\n PROBLEM --> OPT3\n\n OPT1 --> EX1\n OPT2 --> EX2\n OPT3 --> EX3\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n classDef teal fill:#38d9a9,stroke:#000,color:#000\n```\n\n\n**Option 1:** Check whether a credible domain-specific model already exists. Some providers offer models for code, law, finance, biomedical text, or scientific literature. A domain model is not guaranteed to win, but it deserves a place in your evaluation.\n\n**Option 2** is fine-tuning. You take a strong general model and train it on domain-specific query/document pairs, hard negatives, or click logs. A few hundred labels can show direction; thousands are usually better. The quality of the negative examples matters as much as the number of positive examples.\n\n**Option 3**, training from scratch, is rarely the right answer. It requires large training corpora, careful data curation, evaluation infrastructure, and significant compute. Most teams should fine-tune or adapt an existing model before considering it.\n\n### How to Evaluate for Your Domain\n\nThe only reliable way to know if a model works for your domain is to test it on your domain. A simple evaluation process looks like this:\n\n1. Collect 50-200 query-document judgments from your actual data. Each query should have one or more relevant documents, plus realistic distractors.\n2. Embed all queries and all documents.\n3. For each query, rank all documents by cosine similarity.\n4. Measure how often the correct document appears in the top 1, top 5, and top 10 results.\n\nThis takes a few hours of manual work, but it prevents a common mistake: deploying a model that looks strong on public benchmarks and weak on your actual workload.\n\n---\n\n# Practical Benchmarking: Running Your Own Evaluation\n\nPublic benchmarks narrow the field, but the final decision needs numbers from your own data. The goal is to compare candidate models across the three things that usually decide the choice: retrieval quality, latency, and cost.\n\n### Evaluation Metrics\n\nTwo metrics are especially useful for embedding evaluation:\n\n**Mean Reciprocal Rank (MRR)** measures where the first relevant document appears in your ranked results. If the correct document is ranked first, the reciprocal rank is 1. If it is ranked third, the reciprocal rank is 1/3. MRR is the average across all evaluated queries.\n\n**Recall@K** measures how often a relevant document appears anywhere in the top K results. Recall@10 asks: \"Is useful evidence somewhere in the top 10?\" This is important for RAG systems because the model can only use documents that retrieval actually returns.\n\n\n```mermaid\nflowchart LR\n subgraph MRR_Example[\"MRR Example\"]\n Q1_RES[\"Query 1: Correct at rank 1
RR = 1/1 = 1.0\"]:::green\n Q2_RES[\"Query 2: Correct at rank 3
RR = 1/3 = 0.33\"]:::orange\n Q3_RES[\"Query 3: Correct at rank 5
RR = 1/5 = 0.2\"]:::red\n MRR_VAL[\"MRR = (1.0 + 0.33 + 0.2) / 3
= 0.51\"]:::primary\n end\n\n Q1_RES --> MRR_VAL\n Q2_RES --> MRR_VAL\n Q3_RES --> MRR_VAL\n\n classDef primary fill:#00ceff,stroke:#000,color:#000\n classDef orange fill:#ffa94d,stroke:#000,color:#000\n classDef green fill:#69db7c,stroke:#000,color:#000\n classDef red fill:#ff8787,stroke:#000,color:#000\n```\n\n\n### Building a Benchmarking Pipeline\n\nThe following script compares multiple embedding models on a small retrieval task:\n\n\n**main.py**\n\n```python\nimport time\nimport numpy as np\nfrom sentence_transformers import SentenceTransformer\n\ndef compute_cosine_similarity(query_embedding, doc_embeddings):\n \"\"\"Compute cosine similarity between a query and all documents.\"\"\"\n # With normalized embeddings, cosine similarity = dot product\n return np.dot(doc_embeddings, query_embedding)\n\ndef evaluate_model(model_name, queries, documents, relevant_doc_indices, k=10):\n \"\"\"\n Evaluate a single embedding model on a retrieval task.\n\n Args:\n model_name: HuggingFace model name\n queries: List of query strings\n documents: List of document strings\n relevant_doc_indices: List of ints. relevant_doc_indices[i] is the\n index of a relevant document for queries[i]\n k: Number of top results to consider for recall\n\n Returns:\n Dictionary with MRR, recall@k, and latency metrics\n \"\"\"\n model = SentenceTransformer(model_name)\n\n # Measure embedding latency\n start_time = time.time()\n doc_embeddings = model.encode(documents, normalize_embeddings=True)\n doc_embed_time = time.time() - start_time\n\n start_time = time.time()\n query_embeddings = model.encode(queries, normalize_embeddings=True)\n query_embed_time = time.time() - start_time\n\n # Compute retrieval metrics.\n reciprocal_ranks = []\n recall_at_k = 0\n\n for i, query_emb in enumerate(query_embeddings):\n similarities = compute_cosine_similarity(query_emb, doc_embeddings)\n ranked_indices = np.argsort(similarities)[::-1] # Descending order\n\n relevant_doc_idx = relevant_doc_indices[i]\n rank = np.where(ranked_indices == relevant_doc_idx)[0][0] + 1\n\n reciprocal_ranks.append(1.0 / rank)\n if rank <= k:\n recall_at_k += 1\n\n mrr = np.mean(reciprocal_ranks)\n recall = recall_at_k / len(queries)\n\n return {\n \"model\": model_name,\n \"mrr\": round(mrr, 4),\n f\"recall@{k}\": round(recall, 4),\n \"doc_embed_time_sec\": round(doc_embed_time, 2),\n \"query_embed_time_sec\": round(query_embed_time, 2),\n \"dimensions\": doc_embeddings.shape[1],\n }\n```\n\n\nUsing it with sample data looks like this:\n\n\n**main.py**\n\n```python\n# Example: benchmark 3 models\nqueries = [\n \"How does gradient descent work?\",\n \"What is a transformer model?\",\n \"Explain backpropagation\",\n # ... add more queries\n]\n\ndocuments = [\n \"Gradient descent is an optimization algorithm that iteratively adjusts parameters...\",\n \"The transformer architecture uses self-attention mechanisms to process sequences...\",\n \"Backpropagation computes gradients of the loss function with respect to weights...\",\n \"Random forests combine multiple decision trees for improved accuracy...\",\n \"Convolutional neural networks use filters to extract spatial features from images...\",\n # ... add more documents\n]\n\n# relevant_doc_indices[i] = index of a relevant document for queries[i]\nrelevant_doc_indices = [0, 1, 2]\n\nmodels_to_test = [\n \"BAAI/bge-small-en-v1.5\",\n \"BAAI/bge-large-en-v1.5\",\n \"nomic-ai/nomic-embed-text-v1.5\",\n]\n\nresults = []\nfor model_name in models_to_test:\n print(f\"Evaluating {model_name}...\")\n result = evaluate_model(model_name, queries, documents, relevant_doc_indices)\n results.append(result)\n print(f\" MRR: {result['mrr']}, Recall@10: {result['recall@10']}\")\n print(f\" Dimensions: {result['dimensions']}\")\n print(f\" Doc embed time: {result['doc_embed_time_sec']}s\")\n print()\n```\n\n\n**Output:**\n\n\n```shell\nEvaluating BAAI/bge-small-en-v1.5...\n MRR: 0.8333, Recall@10: 1.0\n Dimensions: 384\n Doc embed time: 0.12s\n\nEvaluating BAAI/bge-large-en-v1.5...\n MRR: 1.0, Recall@10: 1.0\n Dimensions: 1024\n Doc embed time: 0.34s\n\nEvaluating nomic-ai/nomic-embed-text-v1.5...\n MRR: 1.0, Recall@10: 1.0\n Dimensions: 768\n Doc embed time: 0.18s\n```\n\n\nThe output above is illustrative. With only a few examples, most models will look good. Differences become visible when you test many queries against a realistic corpus with hard negatives, metadata filters, and the same chunking strategy you plan to use in production.\n\n---\n\n# Quiz\n\n---\n\n### References\n\n- [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) - Massive Text Embedding Benchmark, a widely used benchmark for comparing embedding models\n- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings) - Official documentation for OpenAI's embedding models\n- [Sentence Transformers Documentation](https://www.sbert.net/) - Library for running open-weight embedding models locally\n- [Cohere Embed Documentation](https://docs.cohere.com/docs/cohere-embed) - Cohere's embedding model details\n- [Voyage AI Documentation](https://docs.voyageai.com/) - Domain-specific embedding models for code, law, and finance\n- [Matryoshka Representation Learning Paper](https://arxiv.org/abs/2205.13147) - The technique behind flexible-dimension embeddings","pageType":"ai-engineering"}

Get Premium