Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.264
|View full text |Cite
|
Sign up to set email alerts
|

DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling

Abstract: Pre-trained models like BERT (Devlin et al., 2018) have dominated NLP / IR applications such as single sentence classification, text pair classification, and question answering. However, deploying these models in real systems is highly non-trivial due to their exorbitant computational costs. A common remedy to this is knowledge distillation (Hinton et al., 2015), leading to faster inference. Howeveras we show here -existing works are not optimized for dealing with pairs (or tuples) of texts. Consequently, th… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 17 publications
(17 citation statements)
references
References 23 publications
0
17
0
Order By: Relevance
“…Late-interaction models. The idea of running several transformer layers for the document and the query independently, and then combining them in the last transformer layers, was developed concurrently by multiple teams: PreTTR [29], EARL [12], DC-BERT [33], DiPair [5], and the Deformer [4]. These works show that only a few layers where the query and document interact are sufficient to achieve results close to the performance of a full BERT ranker at a fraction of the runtime cost.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…Late-interaction models. The idea of running several transformer layers for the document and the query independently, and then combining them in the last transformer layers, was developed concurrently by multiple teams: PreTTR [29], EARL [12], DC-BERT [33], DiPair [5], and the Deformer [4]. These works show that only a few layers where the query and document interact are sufficient to achieve results close to the performance of a full BERT ranker at a fraction of the runtime cost.…”
Section: Related Workmentioning
confidence: 99%
“…Our work is based on the late-interaction architecture [4,5,13,29,33], which separates BERT into ๐ฟ independent layers for the documents and the queries, and ๐‘‡ โˆ’ ๐ฟ interleaving layers, where ๐‘‡ is the total number of layers in the original model, e.g., 12 for BERT-Base. Naively storing all documents embeddings consumes a huge amount of storage with a total of ๐‘š โ€ข โ„Ž โ€ข 4 bytes per document, where ๐‘š is the average number of tokens per document and โ„Ž is the model hidden size (384 for the distilled version we use).…”
Section: Succinct Document Representation (Sdr)mentioning
confidence: 99%
See 2 more Smart Citations
“…Several retrieval approaches have been proposed that apply lightweight querydocument scoring on last-layer Transformer features. These consist of multi-vector dual encoders (Luan et al, 2020;Khattab and Zaharia, 2020;Li et al, 2020) that emit multiple query and document vectors which interact via dot products, and multi-layer attention architectures (Gao et al, 2020;Chen et al, 2020;.…”
Section: Introductionmentioning
confidence: 99%