The Lottery Ticket Hypothesis for Pre-trained BERT Networks

Chen, Tianlong; Frankle, Jonathan; Chang, Shiyu; Liu, Sijia; Zhang, Yang; Wang, Zhangyang; Carbin, Michael

doi:10.48550/arxiv.2007.12223

Cited by 44 publications

(52 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, in structured pruning, the best subnetworks of BERT's heads do not quite reach the full model performance. (Chen et al, 2020a) shows for a range of downstream tasks, matching subnetworks at 40% to 90% sparsity exist and they are found at pretrained phase (initialization). This is dissimilar to the prior NLP research where subnetworks emerge only after some amount of training.…”

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

“…In view of that, follow-up works reveal that sparsity patterns might emerge at the initialization , the early stage of training (You et al, 2019) and (Chen et al, 2020b), or in dynamic forms throughout training (Evci et al, 2020) by updating model parameters and architecture typologies simultaneously. Some of the recent findings are that the lottery ticket hypothesis holds for BERT models, i.e., largest weights of the original network do form subnetworks that can be retrained alone to reach the performance close to that of the full model (Prasanna et al, 2020;Chen et al, 2020a).…”

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

“…1.6.1 LTH for BERT (Chen et al, 2020a) applies the LTH to identify matching subnetworks in pretrained BERT models to enforce sparsity in models trained for different downstream tasks. They show that assuming pretrained parameters as the initialization for BERT models, one can find sparse subnetworks that can be trained for downstream NLP tasks.…”

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

“…Experimental evaluation of EarlyBERT shows some degradation in accuracy for fine-tuning. (Prasanna et al, 2020) and (Chen et al, 2020a) explore BERT models from the perspective of the lottery ticket hypothesis (Frankle and Carbin, 2018), looking specifically at the "winning" subnetworks in pre-trained BERT. They find that such subnetworks do exist, and that transferability between subnetworks for different tasks varies.…”

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

See 3 more Smart Citations

On the Compression of Natural Language Models

Damadi¹

2021

Preprint

View full text Add to dashboard Cite

Deep neural networks are effective feature extractors but they are prohibitively large for deployment scenarios. Due to the huge number of parameters, interpretability of parameters in different layers is not straight-forward. This is why neural networks are sometimes considered black boxes. Although simpler models are easier to explain, finding them is not easy. If found, a sparse network that can fit to a data from scratch would help to interpret parameters of a neural network. To this end, (Frankle and Carbin, 2018) showed that typical dense neural networks contain a small sparse sub-network that can be trained to a reach similar test accuracy in an equal number of steps. The goal of this work is to assess whether such a trainable subnetwork exists for natural language models (NLM)s. To achieve this goal we will review state-of-the-art compression techniques such as quantization, knowledge distillation, and pruning.

show abstract

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

Section: The Lottery Ticket Hypothesismentioning

confidence: 99%

See 2 more Smart Citations

On the Compression of Natural Language Models

Damadi¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To address this issue, many efforts have been devoted to compressing the cumbersome transformer architectures into a lightweight counterpart, including knowledge distillation (Jiao et al, 2019;Sanh et al, 2019;Sun et al, 2019;, pruning (Michel et al, 2019;Chen et al, 2020;, and quantization (Zafrir et al, 2019;Bai et al, 2020;Shen et al, 2020). Among all these compression techniques, quantization is a popular solution as it still preserves the original network architecture.…”

Section: Introductionmentioning

confidence: 99%

VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer

Sun¹,

Ma²,

Kang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The transformer architectures with attention mechanisms have obtained success in Nature Language Processing (NLP), and Vision Transformers (ViTs) have recently extended the application domains to various vision tasks. While achieving high performance, ViTs suffer from large model size and high computation complexity that hinders the deployment of them on edge devices. To achieve high throughput on hardware and preserve the model accuracy simultaneously, we propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized ViTs with binary weights and low-precision activations. Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations as well as the optimized parameter settings of the accelerator that fulfill the hardware requirements. The implementations are developed with Vivado High-Level Synthesis (HLS) on the Xilinx ZCU102 FPGA board, and the evaluation results with the DeiT-base model indicate that a frame rate requirement of 24 frames per second (FPS) is satisfied with 8-bit activation quantization, and a target of 30 FPS is met with 6-bit activation quantization. To the best of our knowledge, this is the first time quantization has been incorporated into ViT acceleration on FPGAs with the help of a fully automatic framework to guide the quantization strategy on the software side and the accelerator implementations on the hardware side given the target frame rate. Very small compilation time cost is incurred compared with quantization training, and the generated accelerators show the capability of achieving real-time execution for state-of-the-art ViT models on FPGAs.

show abstract

DiLBERT: Cheap Embeddings for Disease Related Medical NLP

et al. 2021

View full text Add to dashboard Cite

Electronic Health Records include health-related information, among which there is text mentioning health conditions and diagnoses. Usually, text is also coded using appropriate terminologies and classifications. The act of coding is time consuming and prone to mistakes. Consequently, there is increasing demand for clinical text mining tools to help coding. In last few years Natural Language Processing (NLP) models has been shown to be effective in sentence-level tasks. Taking advantage from the transfer learning capabilities of those models, a number of biomedicine and health specific models have been also developed. However, also biomedical models can be seen as too general for some specific area like diagnostic expressions. In this paper, we describe a BERT model specialized on tasks related to diagnoses and health conditions. To obtain a disease-related language model, we created a pre-training corpora starting from ICD-11 entities, and enriched them with documents selected by querying PubMed and Wikipedia with entity names. Finetuning has been carried out towards three downstream tasks on two different datasets.Results show that our model, besides being trained on a much smaller corpora than state-of-the-art algorithms, leads to comparable or higher accuracy scores on all the considered tasks, in particular 97.53% accuracy on death certificate coding, and 81.32% on clinical document coding, which are both slightly higher than other models. To summarize the practical implications of our work, we pre-trained and fine-tuned a domain specific BERT model on a small corpora, with comparable or better performance than state-of-the-art models. This approach may also simplify the development of models for languages different from English, due to the minor quantity of data needed for training.

show abstract

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

Cited by 44 publications

References 21 publications

On the Compression of Natural Language Models

On the Compression of Natural Language Models

VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer

DiLBERT: Cheap Embeddings for Disease Related Medical NLP

Contact Info

Product

Resources

About