quitenominal
quitenominal t1_jbtr6g7 wrote
Reply to comment by Simusid in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
fwiw this has also been my finding when comparing these two embeddings for classification tasks. Better, but not enough to justify the cost
quitenominal t1_jbtqio0 wrote
Reply to comment by Simusid in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
Nice explainer! I think this is good for those with some linear algebra familiarity. I added a further explanation going one level more simple again
quitenominal t1_jbtptri wrote
Reply to comment by deliciously_methodic in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
An embedding is a numerical representation of some data. In this case the data is text.
These representations (read list of numbers) can be learned with some goal in mind. Usually you want the embeddings of similar data to be close to one another, and the embeddings of disparate data to be far.
Often these lists of numbers representing the data are very long - I think the ones from the model above are 768 numbers. So each piece of text is transformed into a list of 768 numbers, and similar text will get similar lists of numbers.
What's being visualized above is a 2 number summary of those 768. This is referred to as a projection, like how a 3D wireframe casts a 2D shadow. This lets us visualize the embeddings and can give a qualitative assessment of their 'goodness' - a.k.a are they grouping things as I expect? (Similar texts are close, disparate texts are far)
quitenominal t1_jdw15ao wrote
Reply to comment by esquire900 in [D] Instruct Datasets for Commercial Use by JohnyWalkerRed
It's in the terms that you can't use data generated through OpenAI to compete with OpenAI - and I believe they'd be able to argue competition were the trained model to be used commercially.
See section 2.C.iii of https://openai.com/policies/terms-of-use