Submitted by itsyourboiirow t3_z76uel in MachineLearning
DinosParkour t1_iy7j1hw wrote
Dense Retrieval (DR) means that you encode your document as a (collection of) dense vector(s)*. In the modern day, this is typically done with the encoder of a pre-trained language model, such as (Distil)BERT or T5 (or even GPT if you're OpenAI [1]). Since you have dense representations, you can no longer use an inverted index, as its efficiency comes from the fact that most words only appear in a few documents. To mitigate this, DR relies on methods such as Approximate Nearest Neighbor search, with frameworks like FAISS [2], to find high-dimensional document embeddings close to that of your query.
In contrast, Sparse Retrieval (SR) projects the document to a sparse vector -- as the name suggests -- which typically aligns with the vocabulary of the document's language. This could be done with traditional Bag-of-Words methods such as TF-IDF or BM25, but as Transformers have taken over (also) this field, you'll see approaches like SPLADE [3] where a neural model is used to infer which vocabulary terms are relevant to a document even if they're not present. This addresses the lexical gap, which is one of the shortcomings of SR since a term might be quite relevant to a document despite never being mentioned verbatim (think of a page that's about dog food without ever mentioning the word "dog"). This figure might help you visualize how neural models can be used for SR.
* Most of the common DR setups embed the whole passage/document as a single vector, similar to a [CLS] representation in NLP. However, late-interaction models such as ColBERT [4] or AligneR [5] try to mitigate the issue of having to choose what to store in your fixed-size vector by computing an embedding per token instead, and then somehow utilizing them (i.e. choosing the most suitable ones) when it comes to computing the query-doc similarity.
[1] https://arxiv.org/abs/2201.10005
[2] https://github.com/facebookresearch/faiss/
[3] https://arxiv.org/abs/2107.05720
itsyourboiirow OP t1_iy9809b wrote
Thanks for the in depth response!
Viewing a single comment thread. View all comments