Submitted by itsyourboiirow t3_z76uel in MachineLearning
I was looking at the BEIR dataset and the leaderboard has two different pages, one for dense IR and sparse IR. I am curious to know what the difference was, I googled around but couldn't find anything conclusive. Is there anyone that's familiar with the difference or anywhere where I can read about it?
DinosParkour t1_iy7j1hw wrote
Dense Retrieval (DR) means that you encode your document as a (collection of) dense vector(s)*. In the modern day, this is typically done with the encoder of a pre-trained language model, such as (Distil)BERT or T5 (or even GPT if you're OpenAI [1]). Since you have dense representations, you can no longer use an inverted index, as its efficiency comes from the fact that most words only appear in a few documents. To mitigate this, DR relies on methods such as Approximate Nearest Neighbor search, with frameworks like FAISS [2], to find high-dimensional document embeddings close to that of your query.
In contrast, Sparse Retrieval (SR) projects the document to a sparse vector -- as the name suggests -- which typically aligns with the vocabulary of the document's language. This could be done with traditional Bag-of-Words methods such as TF-IDF or BM25, but as Transformers have taken over (also) this field, you'll see approaches like SPLADE [3] where a neural model is used to infer which vocabulary terms are relevant to a document even if they're not present. This addresses the lexical gap, which is one of the shortcomings of SR since a term might be quite relevant to a document despite never being mentioned verbatim (think of a page that's about dog food without ever mentioning the word "dog"). This figure might help you visualize how neural models can be used for SR.
* Most of the common DR setups embed the whole passage/document as a single vector, similar to a [CLS] representation in NLP. However, late-interaction models such as ColBERT [4] or AligneR [5] try to mitigate the issue of having to choose what to store in your fixed-size vector by computing an embedding per token instead, and then somehow utilizing them (i.e. choosing the most suitable ones) when it comes to computing the query-doc similarity.
[1] https://arxiv.org/abs/2201.10005
[2] https://github.com/facebookresearch/faiss/
[3] https://arxiv.org/abs/2107.05720
[4] https://arxiv.org/abs/2004.12832
[5] https://arxiv.org/abs/2211.01267