DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling

Chen, Jiecao; Liu, Yang; Raman, Karthik; Bendersky, Michael; Yeh, Jih‐I; Zhou, Yun; Najork, Marc; Cai, Danyang; Emadzadeh, Ehsan

doi:10.18653/v1/2020.findings-emnlp.264

Cited by 17 publications

(17 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Late-interaction models. The idea of running several transformer layers for the document and the query independently, and then combining them in the last transformer layers, was developed concurrently by multiple teams: PreTTR [29], EARL [12], DC-BERT [33], DiPair [5], and the Deformer [4]. These works show that only a few layers where the query and document interact are sufficient to achieve results close to the performance of a full BERT ranker at a fraction of the runtime cost.…”

Section: Related Workmentioning

confidence: 99%

“…Our work is based on the late-interaction architecture [4,5,13,29,33], which separates BERT into 𝐿 independent layers for the documents and the queries, and 𝑇 − 𝐿 interleaving layers, where 𝑇 is the total number of layers in the original model, e.g., 12 for BERT-Base. Naively storing all documents embeddings consumes a huge amount of storage with a total of 𝑚 • ℎ • 4 bytes per document, where 𝑚 is the average number of tokens per document and ℎ is the model hidden size (384 for the distilled version we use).…”

Section: Succinct Document Representation (Sdr)mentioning

confidence: 99%

“…Our algorithm is based on the late-interaction architecture [4,5,12,29,33] depicted in Figure 2. We created a model based on this architecture, which we name BERT SPLIT , consisting of 10 layers that are computed independently for the query and the document with an additional two late-interaction layers that are executed jointly.…”

Section: Baseline -Bert Splitmentioning

confidence: 99%

“…To rank 𝑘 documents, the ranker is called 𝑘 times with an input of the form (query, document), where the query is the same, but the document is different. Several works [4,5,13,13,26,29,33] have proposed to modify BERT-based rankers in a way that allows part of the model to compute query and document representations separately, and then produce the final score using a low-complexity interaction block (we denote these models as late-interaction rankers, see Fig. 2).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

SDR: Efficient Neural Re-ranking using Succinct Document Representation

Cohen¹,

Portnoy²,

Fetahu³

et al. 2021

Preprint

View full text Add to dashboard Cite

BERT based ranking models have achieved superior performance on various information retrieval tasks. However, the large number of parameters and complex self-attention operation come at a significant latency overhead. To remedy this, recent works propose late-interaction architectures, which allow pre-computation of intermediate document representations, thus reducing the runtime latency. Nonetheless, having solved the immediate latency issue, these methods now introduce storage costs and network fetching latency, which limits their adoption in real-life production systems.In this work, we propose the Succinct Document Representation (SDR) scheme that computes highly compressed intermediate document representations, mitigating the storage/network issue. Our approach first reduces the dimension of token representations by encoding them using a novel autoencoder architecture that uses the document's textual content in both the encoding and decoding phases. After this token encoding step, we further reduce the size of entire document representations using a modern quantization technique.Extensive evaluations on passage re-reranking on the MSMARCO dataset show that compared to existing approaches using compressed document representations, our method is highly efficient, achieving 4x-11.6x better compression rates for the same ranking quality.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Succinct Document Representation (Sdr)mentioning

confidence: 99%

Section: Baseline -Bert Splitmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

SDR: Efficient Neural Re-ranking using Succinct Document Representation

Cohen¹,

Portnoy²,

Fetahu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Several retrieval approaches have been proposed that apply lightweight querydocument scoring on last-layer Transformer features. These consist of multi-vector dual encoders (Luan et al, 2020;Khattab and Zaharia, 2020;Li et al, 2020) that emit multiple query and document vectors which interact via dot products, and multi-layer attention architectures (Gao et al, 2020;Chen et al, 2020;.…”

Section: Introductionmentioning

confidence: 99%

Multi-Vector Attention Models for Deep Re-ranking

Zhou¹,

Devlin²

2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Large-scale document retrieval systems often utilize two styles of neural network models which live at two different ends of the joint computation vs. accuracy spectrum. The first style is dual encoder (or two-tower) models, where the query and document representations are computed completely independently and combined with a simple dot product operation. The second style is cross-attention models, where the query and document features are concatenated in the input layer and all computation is based on the joint querydocument representation. Dual encoder models are typically used for retrieval and deep re-ranking, while cross-attention models are typically used for shallow re-ranking. In this paper, we present a lightweight architecture that explores this joint cost vs. accuracy trade-off based on multi-vector attention (MVA). We thoroughly evaluate our method on the MS-MARCO passage retrieval dataset and show how to efficiently trade off retrieval accuracy with joint computation and offline document storage cost. We show that a highly compressed document representation and inexpensive joint computation can be achieved through a combination of learned pooling tokens and aggressive downprojection. Our code and model checkpoints are available on GitHub.

show abstract