Skip to content

Embedding Model Leaderboard

REFERENCEComparison data from MTEB Leaderboard

Choosing the right embedding model is one of the most important decisions in your RAG pipeline. A bad model means bad retrieval. Bad retrieval means bad answers. No amount of prompt engineering fixes that.

This page gives you the data to make that decision. No fluff, just numbers and recommendations.


All embedding models do the same thing: turn text into a list of numbers (a vector) that captures meaning. But they do it with wildly different quality.

A good model puts “How do I reset my password?” and “Steps to change your login credentials” close together in vector space. A bad model puts them far apart because the words are different — even though the meaning is the same.

The difference between a 60% and 85% retrieval accuracy often comes down to the embedding model. Everything else being equal, a better model retrieves better chunks, and better chunks produce better answers.


These models are ranked by their approximate MTEB (Massive Text Embedding Benchmark) retrieval scores. MTEB is the industry-standard benchmark for comparing embedding models across tasks like search, classification, and clustering.

Scores below are approximate and represent the retrieval subset of MTEB. Check the MTEB Leaderboard for the latest numbers, as new models are released frequently.

ModelProviderDimensionsMTEB Retrieval (approx)SpeedCostBest For
text-embedding-3-largeOpenAI3072~55Fast (API)$0.13/1M tokensProduction systems with budget
text-embedding-3-smallOpenAI1536~51Fast (API)$0.02/1M tokensCost-effective production
E5 Large V2Microsoft (intfloat)1024~50Medium (local)Free (open-source)General-purpose. Requires “query:”/“passage:” prefix. Details
Multilingual E5 LargeMicrosoft (intfloat)1024~49Medium (local)Free (open-source)100+ languages, cross-lingual retrieval. Details
voyage-large-2Voyage AI1536~55Fast (API)$0.12/1M tokensCode search, technical docs
Cohere embed-v3Cohere1024~54Fast (API)Free tier availableMultilingual, production
CLIP (ViT-B/32)OpenAI512N/A (multi-modal)Fast (local)Free (open-source)Multi-modal: text + images in same space. Details
Salesforce SFR V2 SmallSalesforce256~48Fast (API)Salesforce Data CloudCompact, optimised for Salesforce RAG. Details
bge-large-en-v1.5BAAI1024~54Medium (local)Free (open-source)Self-hosted production
gte-large-en-v1.5Alibaba1024~52Medium (local)Free (open-source)Self-hosted, general purpose
nomic-embed-text-v1.5Nomic AI768~53Medium (local)Free (open-source)Long documents (8192 token context)
all-MiniLM-L6-v2Sentence Transformers384~41Very fast (local)Free (open-source)Learning, prototyping, in-browser (Playground)
bge-small-en-v1.5BAAI384~46Very fast (local)Free (open-source)In-browser via Transformers.js (Playground)
gte-smallAlibaba384~44Very fast (local)Free (open-source)In-browser, multilingual (Playground)

Dimensions — The length of the vector each model produces. Higher dimensions can capture more nuance but use more memory and storage. 384 dimensions is compact. 3072 is rich but expensive to store at scale.

MTEB Retrieval — A score from the retrieval subset of the MTEB benchmark. Higher is better. This measures how well the model ranks relevant documents above irrelevant ones. Scores above 50 are strong. Above 54 is excellent.

Speed — How fast the model converts text to vectors. API-based models are fast because they run on powerful servers. Local models depend on your hardware. “Very fast” means even a laptop CPU handles it well.

Cost — API models charge per token. Open-source models are free to run but use your own compute. “Free” means the model weights are open and you can run them anywhere.


Use all-MiniLM-L6-v2.

It is small (90MB), fast on CPU, free, and produces 384-dimensional vectors. The quality is lower than the big models, but it is more than enough to learn the concepts and build working demos. This is the model we use in Lab 1 and the Playground.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

Use text-embedding-3-small from OpenAI.

At $0.02 per million tokens, it is extremely cheap. The quality is solid. For most business use cases — internal knowledge bases, customer support bots, document search — this is the sweet spot.

from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input="Your text here"
)

Use text-embedding-3-large from OpenAI or voyage-large-2 from Voyage AI.

These models score highest on retrieval benchmarks. Use them when retrieval quality directly impacts your product (legal search, medical Q&A, compliance systems). The cost difference between small and large is meaningful at scale, so only upgrade when you have measured that it actually improves your results.

Use bge-large-en-v1.5 or nomic-embed-text-v1.5.

Both are open-source, free to run, and competitive with commercial APIs. bge-large-en-v1.5 has excellent benchmark scores. nomic-embed-text-v1.5 supports up to 8192 tokens per input, which is useful for long documents where you want larger chunks.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

You will need a machine with at least 4GB of RAM. A GPU helps but is not required.

Use Cohere embed-v3.

Cohere’s model supports 100+ languages and performs well across all of them. If your documents are in multiple languages, or if users query in one language and documents are in another, this is the best choice. Cohere also offers a generous free tier.


More dimensions means more information per vector, but also:

  • More storage. A million vectors at 384 dimensions takes ~1.5GB. At 3072 dimensions, that is ~12GB.
  • Slower search. Comparing longer vectors takes more computation. The difference is small for thousands of vectors, noticeable for millions.
  • Marginal returns. Going from 384 to 1024 dimensions is a big quality jump. Going from 1024 to 3072 is a smaller one.

For most projects under 100,000 documents, dimensions do not matter much for performance. Choose based on quality and cost, not dimensions.


The MTEB leaderboard tells you how models perform on standardised benchmarks. Your data is not standardised. The model that scores highest on MTEB might not be the best for your specific documents and queries.

Here is how to test:

  1. Create 20 test queries with known correct answers from your documents.
  2. Run each query through your pipeline with Model A. Record which chunks are retrieved.
  3. Swap to Model B. Run the same queries. Record results.
  4. Compare. For each query, did the model retrieve the chunks that contain the correct answer? Count how many queries each model gets right.
from sentence_transformers import SentenceTransformer
import numpy as np
def test_model(model_name, chunks, queries, expected_chunk_indices):
"""Simple benchmark: how often does the model retrieve the right chunk?"""
model = SentenceTransformer(model_name)
chunk_embeddings = model.encode([c for c in chunks])
correct = 0
for query, expected_idx in zip(queries, expected_chunk_indices):
query_embedding = model.encode(query)
similarities = np.dot(chunk_embeddings, query_embedding)
top_3 = np.argsort(similarities)[-3:][::-1]
if expected_idx in top_3:
correct += 1
accuracy = correct / len(queries)
print(f"{model_name}: {accuracy:.0%} ({correct}/{len(queries)} queries correct)")
return accuracy
# Example usage:
# test_model("all-MiniLM-L6-v2", my_chunks, my_queries, my_expected_indices)
# test_model("BAAI/bge-large-en-v1.5", my_chunks, my_queries, my_expected_indices)

This takes 30 minutes to set up and gives you a definitive answer for your data. It is always better than trusting leaderboard scores alone.


The embedding model landscape changes fast. New models appear on the MTEB leaderboard regularly. The models listed here were strong choices as of early 2026, but check the MTEB Leaderboard before committing to a model for a production system.

The evaluation process described above does not change, regardless of which models are available. Pick 2-3 candidates from the leaderboard, test them on your data, and go with the winner.


All benchmark data referenced on this page is sourced from the MTEB Leaderboard on Hugging Face. Individual model documentation is linked from each provider’s official site.