DinosParkour t1_iy7j1hw wrote on November 29, 2022 at 9:19 AM

Dense Retrieval (DR) means that you encode your document as a (collection of) dense vector(s)*. In the modern day, this is typically done with the encoder of a pre-trained language model, such as (Distil)BERT or T5 (or even GPT if you're OpenAI [1]). Since you have dense representations, you can no longer use an inverted index, as its efficiency comes from the fact that most words only appear in a few documents. To mitigate this, DR relies on methods such as Approximate Nearest Neighbor search, with frameworks like FAISS [2], to find high-dimensional document embeddings close to that of your query.

In contrast, Sparse Retrieval (SR) projects the document to a sparse vector -- as the name suggests -- which typically aligns with the vocabulary of the document's language. This could be done with traditional Bag-of-Words methods such as TF-IDF or BM25, but as Transformers have taken over (also) this field, you'll see approaches like SPLADE [3] where a neural model is used to infer which vocabulary terms are relevant to a document even if they're not present. This addresses the lexical gap, which is one of the shortcomings of SR since a term might be quite relevant to a document despite never being mentioned verbatim (think of a page that's about dog food without ever mentioning the word "dog"). This figure might help you visualize how neural models can be used for SR.

* Most of the common DR setups embed the whole passage/document as a single vector, similar to a [CLS] representation in NLP. However, late-interaction models such as ColBERT [4] or AligneR [5] try to mitigate the issue of having to choose what to store in your fixed-size vector by computing an embedding per token instead, and then somehow utilizing them (i.e. choosing the most suitable ones) when it comes to computing the query-doc similarity.

[1] https://arxiv.org/abs/2201.10005

[2] https://github.com/facebookresearch/faiss/

[3] https://arxiv.org/abs/2107.05720

[4] https://arxiv.org/abs/2004.12832

[5] https://arxiv.org/abs/2211.01267

itsyourboiirow OP t1_iy9809b wrote on November 29, 2022 at 6:20 PM

Thanks for the in depth response!