I was looking at the BEIR dataset and the leaderboard has two different pages, one for dense IR and sparse IR. I am curious to know what the difference was, I googled around but couldn't find anything conclusive. Is there anyone that's familiar with the difference or anywhere where I can read about it?

Comments

You must log in or register to comment.

DinosParkour t1_iy7j1hw wrote on November 29, 2022 at 9:19 AM

Dense Retrieval (DR) means that you encode your document as a (collection of) dense vector(s)*. In the modern day, this is typically done with the encoder of a pre-trained language model, such as (Distil)BERT or T5 (or even GPT if you're OpenAI [1]). Since you have dense representations, you can no longer use an inverted index, as its efficiency comes from the fact that most words only appear in a few documents. To mitigate this, DR relies on methods such as Approximate Nearest Neighbor search, with frameworks like FAISS [2], to find high-dimensional document embeddings close to that of your query.

In contrast, Sparse Retrieval (SR) projects the document to a sparse vector -- as the name suggests -- which typically aligns with the vocabulary of the document's language. This could be done with traditional Bag-of-Words methods such as TF-IDF or BM25, but as Transformers have taken over (also) this field, you'll see approaches like SPLADE [3] where a neural model is used to infer which vocabulary terms are relevant to a document even if they're not present. This addresses the lexical gap, which is one of the shortcomings of SR since a term might be quite relevant to a document despite never being mentioned verbatim (think of a page that's about dog food without ever mentioning the word "dog"). This figure might help you visualize how neural models can be used for SR.

* Most of the common DR setups embed the whole passage/document as a single vector, similar to a [CLS] representation in NLP. However, late-interaction models such as ColBERT [4] or AligneR [5] try to mitigate the issue of having to choose what to store in your fixed-size vector by computing an embedding per token instead, and then somehow utilizing them (i.e. choosing the most suitable ones) when it comes to computing the query-doc similarity.

[1] https://arxiv.org/abs/2201.10005

[2] https://github.com/facebookresearch/faiss/

[3] https://arxiv.org/abs/2107.05720

[4] https://arxiv.org/abs/2004.12832

[5] https://arxiv.org/abs/2211.01267

itsyourboiirow OP t1_iy9809b wrote on November 29, 2022 at 6:20 PM

Thanks for the in depth response!

[deleted] t1_iy53608 wrote on November 28, 2022 at 8:32 PM

[deleted]

koolaidman123 t1_iy6hhbj wrote on November 29, 2022 at 2:35 AM

sparse retrieval isn't mutually exclusive to deep learning, splade v2 and colbert v2 are sparse methods, because they still produce higher dimensional sparse vectors, but both leverage bert models to create the sparse representations

also cross-encoders aren't considered retrievers, but rerankers

[deleted] t1_iy6ze2s wrote on November 29, 2022 at 5:09 AM

[deleted]

koolaidman123 t1_iy89kz0 wrote on November 29, 2022 at 2:23 PM

Colbert v2 is literally listed in the beir sparse leaderboards...

Sparse refers to the embedding vector(s), not the model

And ranking/reranking refers to the same thing, but its still distinct from retrieval, which was my point

DinosParkour t1_iy8ei4w wrote on November 29, 2022 at 3:00 PM

ColBERT is both on the dense and the sparse leaderboard :^)

I would describe it as a (sparse?) collection of dense embeddings per query/document, so it's hard to classify it between the two (although I'm leaning more toward the dense categorization).

koolaidman123 t1_iy8hbxd wrote on November 29, 2022 at 3:21 PM

splade-v2 and sparta are exclusively on the sparse leaderboards, and they uses bert

the point is dispelling the notion that sparse retrieval somehow = no dl involved. it's conflating dense retrieval with neural retrieval

BreadfruitMotor9308 t1_iy59plj wrote on November 28, 2022 at 9:14 PM

sparse is where the documents are represented as sparse vectors and dense is where the documents are represented as dense vectors. Examples of sparse can be BOW representations and examples of dense can be embeddings from a language model.