rshah4 t1_jbtfzig wrote on March 11, 2023 at 4:06 PM

Two quick tips for finding the best embedding models:

Sentence Transformers documentation compares models: https://www.sbert.net/docs/pretrained_models.html

Massive Text Embedding Benchmark (MTEB) Leaderboard has 47 different models: https://huggingface.co/spaces/mteb/leaderboard

These will help you compare different models across a lot of benchmark datasets so you can figure out the best one for your use case.

rshah4 t1_jbtsl7o wrote on March 11, 2023 at 5:34 PM

Also, not sure about a recent comparison, but Nils Reimers also tried to empirically analyze OpenAI's embeddings here: https://twitter.com/Nils_Reimers/status/1487014195568775173

He found across 14 datasets that the OpenAI 175B model is actually worse than a tiny MiniLM 22M parameter model that can run in your browser.

That’s to be expected, no? No model is going to be perfect regardless of how it performs on a set (of datasets) as a whole

more than that, GPT is unidirectional, which is really not great a sentence embedder

Looks like text-embedding-ada-002 is already on the MTEB leaderboard! It comes in at #4 overall, and has the highest performance for clustering.

You might also want to look into SentEval, which can help you test the embedding performance on a variety of tasks: https://github.com/facebookresearch/SentEval

Has anyone ranked models with that and published the results?