marcingrzegzhik t1_j6aic0m wrote on January 28, 2023 at 10:57 PM

#1,564,692

If you are looking for product-query similarity, you could try using a Word2Vec model. You can train a Word2Vec model on your dataset, and then use the model to find the most similar words for each product title and user query. This should give you a better understanding of the similarity between the two.

You can also try using an embedding-based approach, such as using an embedding layer in a neural network. This would enable you to learn more complex relationships between product titles and user queries.

You could also try using a matrix factorization technique such as Singular Value Decomposition (SVD) or Non-Negative Matrix Factorization (NMF). These methods can help you to identify latent features in your dataset, which can be used to generate better recommendations.

Hope this helps!

curiousshortguy t1_j6ajise wrote on January 28, 2023 at 11:06 PM

#1,564,888

Replying to marcingrzegzhik (#1,564,692)

> You can also try using an embedding-based approach, such as using an embedding layer in a neural network. This would enable you to learn more complex relationships between product titles and user queries.

He already is doing that using BERT.

curiousshortguy t1_j6ak1cj wrote on January 28, 2023 at 11:10 PM

#1,564,970

Why are you using euclidiean distance? Use cosine distances. The former cares about vector magnitue, the latter doesn't. As a general rule of thumb for comparing vector embeddings, you don't care about magnitude, at best, that typically captures document length.

Do you have more than product titles, such as product descriptions? Where do you get the user queries from? Do you use a default tokenizer for BERT?

lonelyrascal OP t1_j6od67i wrote on January 31, 2023 at 7:26 PM

#1,675,897

Replying to marcingrzegzhik (#1,564,692)

Thank you! I'll try word2vec

lonelyrascal OP t1_j6odjl4 wrote on January 31, 2023 at 7:28 PM

#1,676,031

Replying to curiousshortguy (#1,564,970)

I have product brand, type and color other than titles. Yes I'll try cosine distances next. User queries are just tests done by me. Because there's no other way around except for A/B testing. Thank you.

curiousshortguy t1_j6oflky wrote on January 31, 2023 at 7:41 PM

#1,676,764

Replying to lonelyrascal (#1,676,031)

How do you use these other features? Do you just vectorize and sum the vectors? Or do you do something else?

I think you can leverage data from current production to create a labeled test dataset.

[R] Question: what is the best approach to find similarity between a set of product titles and user query?

Comments