Viewing a single comment thread. View all comments

rshah4 t1_jbtfzig wrote

Two quick tips for finding the best embedding models:

Sentence Transformers documentation compares models: https://www.sbert.net/docs/pretrained_models.html

Massive Text Embedding Benchmark (MTEB) Leaderboard has 47 different models: https://huggingface.co/spaces/mteb/leaderboard

These will help you compare different models across a lot of benchmark datasets so you can figure out the best one for your use case.

50

rshah4 t1_jbtsl7o wrote

Also, not sure about a recent comparison, but Nils Reimers also tried to empirically analyze OpenAI's embeddings here: https://twitter.com/Nils_Reimers/status/1487014195568775173

He found across 14 datasets that the OpenAI 175B model is actually worse than a tiny MiniLM 22M parameter model that can run in your browser.

8

Non-jabroni_redditor t1_jbu2shx wrote

That’s to be expected, no? No model is going to be perfect regardless of how it performs on a set (of datasets) as a whole

1

JClub t1_jbwu3lx wrote

more than that, GPT is unidirectional, which is really not great a sentence embedder

1

phys_user t1_jbw7i59 wrote

Looks like text-embedding-ada-002 is already on the MTEB leaderboard! It comes in at #4 overall, and has the highest performance for clustering.

You might also want to look into SentEval, which can help you test the embedding performance on a variety of tasks: https://github.com/facebookresearch/SentEval

3

vintage2019 t1_jbzzadd wrote

Has anyone ranked models with that and published the results?

1