Submitted by Simusid t3_11okrni in MachineLearning
rshah4 t1_jbtfzig wrote
Two quick tips for finding the best embedding models:
Sentence Transformers documentation compares models: https://www.sbert.net/docs/pretrained_models.html
Massive Text Embedding Benchmark (MTEB) Leaderboard has 47 different models: https://huggingface.co/spaces/mteb/leaderboard
These will help you compare different models across a lot of benchmark datasets so you can figure out the best one for your use case.
rshah4 t1_jbtsl7o wrote
Also, not sure about a recent comparison, but Nils Reimers also tried to empirically analyze OpenAI's embeddings here: https://twitter.com/Nils_Reimers/status/1487014195568775173
He found across 14 datasets that the OpenAI 175B model is actually worse than a tiny MiniLM 22M parameter model that can run in your browser.
Non-jabroni_redditor t1_jbu2shx wrote
That’s to be expected, no? No model is going to be perfect regardless of how it performs on a set (of datasets) as a whole
JClub t1_jbwu3lx wrote
more than that, GPT is unidirectional, which is really not great a sentence embedder
phys_user t1_jbw7i59 wrote
Looks like text-embedding-ada-002 is already on the MTEB leaderboard! It comes in at #4 overall, and has the highest performance for clustering.
You might also want to look into SentEval, which can help you test the embedding performance on a variety of tasks: https://github.com/facebookresearch/SentEval
vintage2019 t1_jbzzadd wrote
Has anyone ranked models with that and published the results?
Viewing a single comment thread. View all comments