2020
DOI: 10.48550/arxiv.2006.11316
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SqueezeBERT: What can computer vision teach NLP about efficient neural networks?

Abstract: Humans read and write hundreds of billions of messages every day. Further, due to the availability of large datasets, large computing systems, and better neural network models, natural language processing (NLP) technology has made significant strides in understanding, proofreading, and organizing these messages. Thus, there is a significant opportunity to deploy NLP in myriad applications to help web users, social networks, and businesses. In particular, we consider smartphones and other mobile devices as cruc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(13 citation statements)
references
References 56 publications
0
7
0
Order By: Relevance
“…Note that similar to NMT and LM, except learning rate and block size, ADAHESSIAN directly uses the same hyperparameters as AdamW. Interestingly note that these results are better than those reported in SqueezeBERT (Iandola et al 2020), even though we only change the optimizer to ADAHESSIAN instead of AdamW.…”
Section: Natural Language Understandingmentioning
confidence: 85%
See 1 more Smart Citation
“…Note that similar to NMT and LM, except learning rate and block size, ADAHESSIAN directly uses the same hyperparameters as AdamW. Interestingly note that these results are better than those reported in SqueezeBERT (Iandola et al 2020), even though we only change the optimizer to ADAHESSIAN instead of AdamW.…”
Section: Natural Language Understandingmentioning
confidence: 85%
“…Comparison of AdamW and ADAHESSIAN for SqueezeBERT on the development set of the GLUE benchmark. The result of AdamW + is directly from(Iandola et al 2020) and the result of AdamW * is reproduced by us. AdamW 35.42 ± .09 35.66 ± .11 35.37 ± .07 35.18 ± .07 34.79 ± .15 14.41 ± 13.25 0.41 ± .32 Diverge ADAHESSIAN 35.33 ± .10 35.79 ± .06 35.21 ± .14 34.74 ± .10 34.19 ± .06 33.78 ± .14 32.70 ± .10 32.48 ± .83…”
mentioning
confidence: 87%
“…Lastly, we generated embeddings of these POS-tagged tokens using the SIFRank and SIFRankplus embedding techniques for short and long documents, respectively. Recognizing the performance of state-of-the-art transformer-based pre-trained language models, we replaced ELMo with the pre-trained SqueezeBERT [14] as a word embedding method in SIFRank and SIFRankplus. The decision to use SqueezeBERT is motivated by its lightweight transformer architecture with higher information flow between the layers; moreover, it is faster than the BERT model [14].…”
Section: Keyphrase Extractionmentioning
confidence: 99%
“…Through this research, we make the following contributions: (1) we adopt and adapt state-of-the-art word/sentence embedding techniques to automatically construct an EduKG, (2) we enhance the SIFRank keyphrase extraction method proposed in [13] by adopting SqueezeBERT [14], a transformer model for word embedding, (3) we propose an embeddingbased concept-weighting strategy using the sentence embedding technique SBERT [15], and (4) we conduct empirical studies on different datasets, demonstrating the effectiveness of the SqueezeBERT-enhanced SIFRank keyphrase extraction method as well as the efficiency of SBERT-based concept-weighting strategy against several baselines.…”
Section: Introductionmentioning
confidence: 99%
“…Pre-trained transformers can be then effectively fine-tuned for downstream supervised tasks, usually with little to none architectural changes. Among the considerable variety of pre-trained transformer models, BERT [2] and its derivatives [17][18][19][20][21] have become the de facto standard for deep language modeling. The strength of this model comes from the bidirectional pretraining strategy, which leverages a huge amount of unlabeled text in an unsupervised fashion.…”
Section: Transformer Based Language Modelingmentioning
confidence: 99%