DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

Cao, Qingqing; Trivedi, Harsh; Balasubramanian, Aruna; Balasubramanian, Niranjan

doi:10.18653/v1/2020.acl-main.411

Cited by 48 publications

(45 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While knowledge distillation on output logits is most commonly used to train smaller BERT models (Sun et al, 2019;Sanh et al, 2019;Jiao et al, 2020;Zhao et al, 2019b;Cao et al, 2020;Sun et al, 2020b;Song et al, 2020;Mao et al, 2020;Ding and Yang, 2020;Noach and Goldberg, 2020), the student does not need to be a smaller version of BERT or even a Transformer, and can follow a completely different architecture. Below we describe the two commonly used replacements:…”

Section: Knowledge Distillationmentioning

confidence: 99%

“…Attention Decomposition. It has been shown that computing attention over the entire sentence makes a large number of redundant computations (Tay et al, 2020;Cao et al, 2020). Thus, it has been proposed to do it in smaller groups, by either binning them using spatial locality (Cao et al, 2020), magnitude-based locality (Kitaev et al, 2020), or an adaptive attention span (Tambe et al, 2020).…”

Section: The Reduction In Model Size and Runtime Memory Use Is Sizable If Cmentioning

confidence: 99%

“…It has been shown that computing attention over the entire sentence makes a large number of redundant computations (Tay et al, 2020;Cao et al, 2020). Thus, it has been proposed to do it in smaller groups, by either binning them using spatial locality (Cao et al, 2020), magnitude-based locality (Kitaev et al, 2020), or an adaptive attention span (Tambe et al, 2020). Moreover, since the outputs are calculated independently, local attention methods also enable a higher degree of parallel processing and individual representations can be saved during inference for multiple uses.…”

Section: The Reduction In Model Size and Runtime Memory Use Is Sizable If Cmentioning

confidence: 99%

See 2 more Smart Citations

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Ganesh

Chen

Lou

et al. 2021

Transactions of the Association for Computational Linguistics

100

View full text Add to dashboard Cite

Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks. However, these models often have billions of parameters, and thus are too resource- hungry and computation-intensive to suit low- capability devices or applications with strict latency requirements. One potential remedy for this is model compression, which has attracted considerable research attention. Here, we summarize the research in compressing Transformers, focusing on the especially popular BERT model. In particular, we survey the state of the art in compression for BERT, we clarify the current best practices for compressing large-scale Transformer models, and we provide insights into the workings of various methods. Our categorization and analysis also shed light on promising future research directions for achieving lightweight, accurate, and generic NLP models.

show abstract

Section: Knowledge Distillationmentioning

confidence: 99%

Section: The Reduction In Model Size and Runtime Memory Use Is Sizable If Cmentioning

confidence: 99%

Section: The Reduction In Model Size and Runtime Memory Use Is Sizable If Cmentioning

confidence: 99%

See 1 more Smart Citation

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Ganesh

Chen

Lou

et al. 2021

Transactions of the Association for Computational Linguistics

100

View full text Add to dashboard Cite

show abstract

“…DeFormer (Cao et al, 2020) is designed for question answering, which encodes questions and passages separately in lower layers. It precomputes all the passage representation and reuses them to speed up the inference.…”

Section: Baselinesmentioning

confidence: 99%

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

Ye¹,

Lin²,

Huang³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Existing pre-trained language models (PLMs) are often computationally expensive in inference, making them impractical in various resource-limited real-world applications. To address this issue, we propose a dynamic token reduction approach to accelerate PLMs' inference, named TR-BERT, which could flexibly adapt the layer number of each token in inference to avoid redundant calculation. Specially, TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning. The experimental results on several downstream NLP tasks show that TR-BERT is able to speed up BERT by 2-5 times to satisfy various performance demands. Moreover, TR-BERT can also achieve better performance with less computation in a suite of long-text tasks since its token-level layer number adaption greatly accelerates the self-attention operation in PLMs. The source code and experiment details of this paper can be obtained from https://github.com/ thunlp/TR-BERT.

show abstract

“…Question-Answering (QA) is an important natural language processing task in which a model understands questions and answers them based on its understanding of the questions. Several QA tasks such as ARC [1], SQuAD [2], and HotpotQA [3] were recently proposed, and many QA models based on a pre-trained language model have been developed to solve these QA tasks [4][5][6][7]. In these QA tasks, the questions are in general prepared without consideration of difficulty.…”

Section: Introductionmentioning

confidence: 99%

Question Difficulty Estimation Based on Attention Model for Question Answering

2021

View full text Add to dashboard Cite

This paper addresses a question difficulty estimation of which goal is to estimate the difficulty level of a given question in question-answering (QA) tasks. Since a question in the tasks is composed of a questionary sentence and a set of information components such as a description and candidate answers, it is important to model the relationship among the information components to estimate the difficulty level of the question. However, existing approaches to this task modeled a simple relationship such as a relationship between a questionary sentence and a description, but such simple relationships are insufficient to predict the difficulty level accurately. Therefore, this paper proposes an attention-based model to consider the complicated relationship among the information components. The proposed model first represents bi-directional relationships between a questionary sentence and each information component using a dual multi-head co-attention, since the questionary sentence is a key factor in the QA questions and it affects and is affected by information components. Then, the proposed model considers inter-information relationship over the bi-directional representations through a self-attention model. The inter-information relationship helps predict the difficulty of the questions accurately which require reasoning over multiple kinds of information components. The experimental results from three well-known and real-world QA data sets prove that the proposed model outperforms the previous state-of-the-art and pre-trained language model baselines. It is also shown that the proposed model is robust against the increase of the number of information components.

show abstract

DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

Cited by 48 publications

References 28 publications

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

Question Difficulty Estimation Based on Attention Model for Question Answering

Contact Info

Product

Resources

About