Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.411
|View full text |Cite
|
Sign up to set email alerts
|

DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

Abstract: Transformer-based QA models use input-wide self-attention -i.e. across both the question and the input passage -at all layers, causing them to be slow and memory-intensive. It turns out that we can get by without inputwide self-attention at all layers, especially in the lower layers. We introduce DeFormer, a decomposed transformer, which substitutes the full self-attention with question-wide and passage-wide self-attentions in the lower layers. This allows for question-independent processing of the input text … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
45
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 48 publications
(45 citation statements)
references
References 28 publications
0
45
0
Order By: Relevance
“…While knowledge distillation on output logits is most commonly used to train smaller BERT models (Sun et al, 2019;Sanh et al, 2019;Jiao et al, 2020;Zhao et al, 2019b;Cao et al, 2020;Sun et al, 2020b;Song et al, 2020;Mao et al, 2020;Ding and Yang, 2020;Noach and Goldberg, 2020), the student does not need to be a smaller version of BERT or even a Transformer, and can follow a completely different architecture. Below we describe the two commonly used replacements:…”
Section: Knowledge Distillationmentioning
confidence: 99%
See 2 more Smart Citations
“…While knowledge distillation on output logits is most commonly used to train smaller BERT models (Sun et al, 2019;Sanh et al, 2019;Jiao et al, 2020;Zhao et al, 2019b;Cao et al, 2020;Sun et al, 2020b;Song et al, 2020;Mao et al, 2020;Ding and Yang, 2020;Noach and Goldberg, 2020), the student does not need to be a smaller version of BERT or even a Transformer, and can follow a completely different architecture. Below we describe the two commonly used replacements:…”
Section: Knowledge Distillationmentioning
confidence: 99%
“…Attention Decomposition. It has been shown that computing attention over the entire sentence makes a large number of redundant computations (Tay et al, 2020;Cao et al, 2020). Thus, it has been proposed to do it in smaller groups, by either binning them using spatial locality (Cao et al, 2020), magnitude-based locality (Kitaev et al, 2020), or an adaptive attention span (Tambe et al, 2020).…”
Section: The Reduction In Model Size and Runtime Memory Use Is Sizable If Cmentioning
confidence: 99%
See 1 more Smart Citation
“…DeFormer (Cao et al, 2020) is designed for question answering, which encodes questions and passages separately in lower layers. It precomputes all the passage representation and reuses them to speed up the inference.…”
Section: Baselinesmentioning
confidence: 99%
“…Question-Answering (QA) is an important natural language processing task in which a model understands questions and answers them based on its understanding of the questions. Several QA tasks such as ARC [1], SQuAD [2], and HotpotQA [3] were recently proposed, and many QA models based on a pre-trained language model have been developed to solve these QA tasks [4][5][6][7]. In these QA tasks, the questions are in general prepared without consideration of difficulty.…”
Section: Introductionmentioning
confidence: 99%