ID4gotten t1_jbt63ni wrote on March 11, 2023 at 2:54 PM

Maybe I'm being "dense", but what task was your network trained to accomplish? That wasn't clear to me from your description.

Simusid OP t1_jbt91tb wrote on March 11, 2023 at 3:16 PM

My main goal was to just visualize the embeddings to see if they are grossly different. They are not. That is just a qualitative view. My second goal was to use the embeddings with a trivial supervised classifier. The dataset is labeled with four labels. So I made a generic network to see if there was any consistency in the training. And regardless of hyperparameters, the OpenAI embeddings seemed to always outperform the SentenceTransformer embeddings, slightly but consistency.

This was not meant to be rigorous. I did this to get a general feel of the quality of the embeddings, plus to get a little experience with the OpenAI API.

quitenominal t1_jbtr6g7 wrote on March 11, 2023 at 5:24 PM

fwiw this has also been my finding when comparing these two embeddings for classification tasks. Better, but not enough to justify the cost

polandtown t1_jbu2zqe wrote on March 11, 2023 at 6:47 PM

Learning here, but how are you axes defined? Some kind of factor(s) or component(s) extracted from each individual embedding? Thanks for the visualization, as it made me curious and interested! Good work!

Simusid OP t1_jbu3q8m wrote on March 11, 2023 at 6:52 PM

Here is some explanation about UMAP axes and why they should usually be ignored: https://stats.stackexchange.com/questions/527235/how-to-interpret-axis-of-umap

Basically it's because they are nonlinear.

onkus t1_jbwftny wrote on March 12, 2023 at 6:21 AM

Doesn’t this also make it essentially impossible to compare the two figures you’ve shown?

Thog78 t1_jbyh4w1 wrote on March 12, 2023 at 6:24 PM

What you're looking for when comparing UMAPs is if the local relationships are the same. Try to recognize clusters and see their neighbors, or whether they are distinct or not. A much finer colored clustering based on another reduction (typically PCA) helps with that. Without clustering, you can only try to recognize landmarks from their size and shape.

[deleted] t1_jbyaq18 wrote on March 12, 2023 at 5:40 PM

[deleted]

polandtown t1_jbu56lb wrote on March 11, 2023 at 7:02 PM

Thanks!

[deleted] t1_jbtcsig wrote on March 11, 2023 at 3:43 PM

[deleted]

Geneocrat t1_jbu4law wrote on March 11, 2023 at 6:58 PM

Thanks for asking the questions seemingly obvious questions so that I don’t have to wonder.

imaginethezmell t1_jbszsey wrote on March 11, 2023 at 2:03 PM

openai is 8k

how about sentence transformer

montcarl t1_jbtexjk wrote on March 11, 2023 at 3:58 PM

This is an important point. The performance similarities indicate that the sentence lengths of the 20k dataset were mostly within the SentenceTransformer max length cutoff. It would be nice to confirm this and also run another test with longer examples. This new test should result in a larger performance gap.

Simusid OP t1_jbt13iy wrote on March 11, 2023 at 2:14 PM

8K? I'm not sure what you're referring to.

VarietyElderberry t1_jbt5zkd wrote on March 11, 2023 at 2:53 PM

I'm assuming u/imaginethezmell is referring to the context length. Indeed, if there is a need for longer context lengths, then OpenAI outcompetes SentenceTransformer which has a default context length of 128.

LetterRip t1_jbtn573 wrote on March 11, 2023 at 4:56 PM

number of total tokens in input + output.

rajanjedi t1_jbt625q wrote on March 11, 2023 at 2:53 PM

Number of tokens in the input perhaps?

[deleted] t1_jbt7rkv wrote on March 11, 2023 at 3:06 PM

[deleted]

[Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings

Simusid OP t1_jbsyp5n wrote on March 11, 2023 at 1:54 PM

ID4gotten t1_jbt63ni wrote on March 11, 2023 at 2:54 PM

Simusid OP t1_jbt91tb wrote on March 11, 2023 at 3:16 PM

quitenominal t1_jbtr6g7 wrote on March 11, 2023 at 5:24 PM

polandtown t1_jbu2zqe wrote on March 11, 2023 at 6:47 PM

Simusid OP t1_jbu3q8m wrote on March 11, 2023 at 6:52 PM

onkus t1_jbwftny wrote on March 12, 2023 at 6:21 AM

Thog78 t1_jbyh4w1 wrote on March 12, 2023 at 6:24 PM

[deleted] t1_jbyaq18 wrote on March 12, 2023 at 5:40 PM

polandtown t1_jbu56lb wrote on March 11, 2023 at 7:02 PM

[deleted] t1_jbtcsig wrote on March 11, 2023 at 3:43 PM

Geneocrat t1_jbu4law wrote on March 11, 2023 at 6:58 PM

imaginethezmell t1_jbszsey wrote on March 11, 2023 at 2:03 PM

montcarl t1_jbtexjk wrote on March 11, 2023 at 3:58 PM

Simusid OP t1_jbt13iy wrote on March 11, 2023 at 2:14 PM

VarietyElderberry t1_jbt5zkd wrote on March 11, 2023 at 2:53 PM

LetterRip t1_jbtn573 wrote on March 11, 2023 at 4:56 PM

rajanjedi t1_jbt625q wrote on March 11, 2023 at 2:53 PM

[deleted] t1_jbt7rkv wrote on March 11, 2023 at 3:06 PM