BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression

Ji, Xin; Tang, Rachel; Yu, Yaoliang; Lin, Jimmy

doi:10.18653/v1/2021.eacl-main.8

Cited by 57 publications

(61 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our results have direct implications on the use of BERT as a knowledge base. By effectively choosing layers to query and adopting early exiting strategies (Xin et al 2020(Xin et al , 2021 knowledge base completion can be improved. The performance of RANK-MSMARCO also warrants further investigation into ranking models with different training objectives -pointwise (regression) vs pairwise vs listwise.…”

Section: Discussionmentioning

confidence: 99%

BERTnesia: Investigating the capture and forgetting of knowledge in BERT

Wallat¹,

Singh²,

Anand³

2021

Preprint

View full text Add to dashboard Cite

Probing complex language models has recently revealed several insights into linguistic and semantic patterns found in the learned representations. In this article, we probe BERT specifically to understand and measure the relational knowledge it captures in its parametric memory. While probing for linguistic understanding is commonly applied to all layers of BERT as well as finetuned models, this has not been done for factual knowledge. We utilize existing knowledge base completion tasks (LAMA) to probe every layer of pre-trained as well as fine-tuned BERT models (ranking, question answering, NER). Our findings show that knowledge is not just contained in BERT's final layers. Intermediate layers contribute a significant amount (17-60%) to the total knowledge found. Probing intermediate layers also reveals how different types of knowledge emerge at varying rates. When BERT is fine-tuned, relational knowledge is forgotten. The extent of forgetting is impacted by the fine-tuning objective and the training data. We found that ranking models forget the least and retain more knowledge in their final layer compared to masked language modeling and question-answering. However, masked language modeling performed the best at acquiring new knowledge from the training data. When it comes to learning facts, we found that capacity and fact density are key factors. We hope this initial work will spur further research into understanding the parametric memory of language models and the effect of training objectives on factual knowledge. The code to repeat the experiments is publicly available on GitHub 1 .

show abstract

Section: Discussionmentioning

confidence: 99%

BERTnesia: Investigating the capture and forgetting of knowledge in BERT

Wallat¹,

Singh²,

Anand³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Before-prediction (Elbayad et al, 2020) (Xin et al, 2021) Take features as input and generate the label deciding whether to execute the forward process.…”

Section: Mlp(ŷ)mentioning

confidence: 99%

“…Mixture-of-experts-style dynamic networks are representative dynamic models (Lepikhin et al, 2021;Lin et al, 2021;Fedus et al, 2021). In those models, a layer contains multiple experts and only part of these experts will be activated for each instance.…”

Section: Mlp(ŷ)mentioning

confidence: 99%

“…For example, and utilized counting-based criteria to support early exiting. Without relying on heuristic criteria, several approaches(Xin et al, 2021;Schuster et al, 2021) directly learned the criteria by introducing a small module to decide whether to execute the forward process. Due to the simplicity of single-step prediction, dynamic networks are widely applied to classification models.…”

mentioning

confidence: 99%

See 1 more Smart Citation

A Survey on Green Deep Learning

Xu¹,

Zhou²,

Fu³

et al. 2021

Preprint

View full text Add to dashboard Cite

In recent years, larger and deeper models are springing up and continuously pushing state-of-the-art (SOTA) results across various fields like natural language processing (NLP) and computer vision (CV). However, despite promising results, it needs to be noted that the computations required by SOTA models have been increased at an exponential rate. Massive computations not only have a surprisingly large carbon footprint but also have negative effects on research inclusiveness and deployment on real-world applications. Green deep learning is an increasingly hot research field that appeals to researchers to pay attention to energy usage and carbon emission during model training and inference. The target is to yield novel results with lightweight and efficient technologies. Many technologies can be used to achieve this goal, like model compression and knowledge distillation. This paper focuses on presenting a systematic review of the development of Green deep learning technologies. We classify these approaches into four categories: (1) compact networks, (2) energy-efficient training strategies, (3) energy-efficient inference approaches, and (4) efficient data usage. For each category, we discuss the progress that has been achieved and the unresolved challenges.

show abstract

“…Alternative methods are concerned with strategies for pruning the ensemble during or after the training phase [8,9,11], and budget-aware learning-to-rank algorithms [1,13]. Furthermore, researchers investigated early termination heuristics aimed to reduce, on a document-or query-level basis, the cost of the scoring process [3,12,15]. These works studied the impact of the proposed early termination strategies on both the latency and ranking accuracy.…”

Section: Introductionmentioning

confidence: 99%

Learning Early Exit Strategies for Additive Ranking Ensembles

Busolin

Lucchese

Nardini

et al. 2021

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

Modern search engine ranking pipelines are commonly based on large machine-learned ensembles of regression trees. We propose LEAR, a novel -learned -technique aimed to reduce the average number of trees traversed by documents to accumulate the scores, thus reducing the overall query response time. LEAR exploits a classifier that predicts whether a document can early exit the ensemble because it is unlikely to be ranked among the final top-results. The early exit decision occurs at a sentinel point, i.e., after having evaluated a limited number of trees, and the partial scores are exploited to filter out non-promising documents. We evaluate LEAR by deploying it in a production-like setting, adopting a state-of-theart algorithm for ensembles traversal. We provide a comprehensive experimental evaluation on two public datasets. The experiments show that LEAR has a significant impact on the efficiency of the query processing without hindering its ranking quality. In detail, on a first dataset, LEAR is able to achieve a speedup of 3× without any loss in NDCG@10, while on a second dataset the speedup is larger than 5× with a negligible NDCG@10 loss (< 0.05%).

show abstract

BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression

Cited by 57 publications

References 20 publications

BERTnesia: Investigating the capture and forgetting of knowledge in BERT

BERTnesia: Investigating the capture and forgetting of knowledge in BERT

A Survey on Green Deep Learning

Learning Early Exit Strategies for Additive Ranking Ensembles

Contact Info

Product

Resources

About