Simusid OP t1_jbtp8wr wrote on March 11, 2023 at 5:10 PM

Reply to comment by deliciously_methodic in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid

Given three sentences:

Tom went to the bank to make a payment on his mortgage.
Yesterday my wife went to the credit union and withdrew $500.
My friend was fishing along the river bank, slipped and fell in the water.

Reading those you immediately know that the first two are related because they are both about banks/money/finance. You also know that they are unrelated to the third sentence even though the first and third share the word "bank". If we had naively encoded a strictly word based model, it might incorrectly associate the first and third sentences.

What we want is a model that can represent the "semantic content" or idea behind a sentence in a way that we can make valid mathematical comparisons. We want to create a "metric space". In that space, each sentence will be represented by a vector. Then we use standard math operations to compute the distances between the vectors. In other words, the first two sentences will have vectors that point basically in the same direction, and the third vector will point in a very different direction.

The job of the language models (BERT, RoBERTa, all-mpnet-v2, etc) are to do the best job possible turning sentences into vectors. The output of these models are very high dimension, 768 dimensions and higher. We cannot visualize that, so we use tools like UMAP, tSNE, PCA, and eig to find the 2 or 3 most important components and then display them as pretty 2 or 3D point clouds.

In short, the embedding is the vector that represents the sentence in a (hopefully) valid metric space.

quitenominal t1_jbtqio0 wrote on March 11, 2023 at 5:19 PM

Nice explainer! I think this is good for those with some linear algebra familiarity. I added a further explanation going one level more simple again

utopiah t1_jbtx8iv wrote on March 11, 2023 at 6:06 PM

> What we want is a model that can represent the "semantic content" or idea behind a sentence

We do but is it what embedding actually provide or rather some kind of distance between items, how they might relate or not between each other? I'm not sure that would be sufficient for most people to provide the "idea" behind a sentence, just relatedness. I'm not saying it's not useful but arguing against the semantic aspect here, at least from my understanding of that explanation.

Simusid OP t1_jbu0bkv wrote on March 11, 2023 at 6:28 PM

>We do but is it what embedding actually provide or rather some kind of distance between items,

A single embedding is a single vector, encoding a single sentence. To identify a relationship between sentences, you need to compare vectors. Typically this is done with cosine distance between the vectors. The expectation is that if you have a collection of sentences that all talk about cats, the vectors that represent them will exist in a related neighborhood in the metric space.

utopiah t1_jbu0qpa wrote on March 11, 2023 at 6:31 PM

Still says absolutely nothing if you don't know what a cat is.

Simusid OP t1_jbu2n5w wrote on March 11, 2023 at 6:44 PM

That was not the point at all.

Continuing the cat analogy, I have two different cameras. I take 20,000 pictures of the same cats with both. I have two datasets of 20,000 cats. Is one dataset superior to the other? I will build a model that tries to predict cats and see if the "quality" of one dataset is better than the other.

In this case, the OpenAI dataset appears to be slightly better.

[deleted] t1_jbtztzc wrote on March 11, 2023 at 6:24 PM

[deleted]

deliciously_methodic t1_jcifdxa wrote on March 17, 2023 at 1:26 AM

Thanks very informative. Can we dumb this down further? What would a 3 dimensional embedding table look like for the following sentences? And how do we go from words to numbers, what is the algorithm?

Bank deposit.
Bank withdrawal.
River bank.

Simusid OP t1_jciguq5 wrote on March 17, 2023 at 1:38 AM

"words to numbers" is the secret sauce of all the models including the new GPT-4. Individual words are tokenized (sometimes into "word pieces") and a mapping from the tokens to numbers via a vocabulary is made. Then the model is trained on pairs of sentences A and B. Sometimes the model is shown a pair where B correctly follows A, and sometimes not. Eventually the model learns to predict what is most likely to come next.

"he went to the bank", "he made a deposit"

B probably follows A

"he went to the bank", "he bought a duck"

Does not.

That is one type of training to learn valid/invalid text. Another is "leave one out" training. In this case the input is a full sentence minus one word (typically).

"he went to the convenience store and bought a gallon of _____"

and the model should learn that the most common answer will probably be "milk"

Back to your first question. In 3D your first two embeddings should be closer together because they are similar. And they should be both "far' from the third encoding.