SqueezeBERT: What can computer vision teach NLP about efficient neural networks?

Iandola, Forrest; Shaw, Albert; Krishna, R. Hari; Keutzer, Kurt

doi:10.48550/arxiv.2006.11316

Cited by 10 publications

(13 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that similar to NMT and LM, except learning rate and block size, ADAHESSIAN directly uses the same hyperparameters as AdamW. Interestingly note that these results are better than those reported in SqueezeBERT (Iandola et al 2020), even though we only change the optimizer to ADAHESSIAN instead of AdamW.…”

Section: Natural Language Understandingmentioning

confidence: 85%

“…Comparison of AdamW and ADAHESSIAN for SqueezeBERT on the development set of the GLUE benchmark. The result of AdamW + is directly from(Iandola et al 2020) and the result of AdamW * is reproduced by us. AdamW 35.42 ± .09 35.66 ± .11 35.37 ± .07 35.18 ± .07 34.79 ± .15 14.41 ± 13.25 0.41 ± .32 Diverge ADAHESSIAN 35.33 ± .10 35.79 ± .06 35.21 ± .14 34.74 ± .10 34.19 ± .06 33.78 ± .14 32.70 ± .10 32.48 ± .83…”

mentioning

confidence: 87%

See 1 more Smart Citation

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

Yao

Gholami

Shen

et al. 2021

AAAI

View full text Add to dashboard Cite

Incorporating second-order curvature information into machine learning optimization algorithms can be subtle, and doing so naïvely can lead to high per-iteration costs associated with forming the Hessian and performing the associated linear system solve. To address this, we introduce ADAHESSIAN, a new stochastic optimization algorithm. ADAHESSIAN directly incorporates approximate curvature information from the loss function, and it includes several novel performance-improving features, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a spatial averaging to reduce the variance of the second derivative; and (iii) a root-mean-square exponential moving average to smooth out variations of the second-derivative across different iterations. We perform extensive tests on NLP, CV, and recommendation system tasks, and ADAHESSIAN achieves state-of-the-art results. In particular, we find that ADAHESSIAN: (i) outperforms AdamW for transformers by0.13/0.33 BLEU score on IWSLT14/WMT14, 2.7/1.0 PPLon PTB/Wikitext-103; (ii) outperforms AdamW for Squeeze-Bert by 0.41 points on GLUE; (iii) achieves 1.45%/5.55%higher accuracy on ResNet32/ResNet18 on Cifar10/ImageNetas compared to Adam; and (iv) achieves 0.032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. The cost per iteration of ADAHESSIANis comparable to first-order methods, and ADAHESSIAN exhibits improved robustness towards variations in hyperparameter values. The code for ADAHESSIAN is open-sourced and publicly-available [1].

show abstract

Section: Natural Language Understandingmentioning

confidence: 85%

mentioning

confidence: 87%

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

Yao

Gholami

Shen

et al. 2021

AAAI

View full text Add to dashboard Cite

show abstract

“…Lastly, we generated embeddings of these POS-tagged tokens using the SIFRank and SIFRankplus embedding techniques for short and long documents, respectively. Recognizing the performance of state-of-the-art transformer-based pre-trained language models, we replaced ELMo with the pre-trained SqueezeBERT [14] as a word embedding method in SIFRank and SIFRankplus. The decision to use SqueezeBERT is motivated by its lightweight transformer architecture with higher information flow between the layers; moreover, it is faster than the BERT model [14].…”

Section: Keyphrase Extractionmentioning

confidence: 99%

“…Through this research, we make the following contributions: (1) we adopt and adapt state-of-the-art word/sentence embedding techniques to automatically construct an EduKG, (2) we enhance the SIFRank keyphrase extraction method proposed in [13] by adopting SqueezeBERT [14], a transformer model for word embedding, (3) we propose an embeddingbased concept-weighting strategy using the sentence embedding technique SBERT [15], and (4) we conduct empirical studies on different datasets, demonstrating the effectiveness of the SqueezeBERT-enhanced SIFRank keyphrase extraction method as well as the efficiency of SBERT-based concept-weighting strategy against several baselines.…”

Section: Introductionmentioning

confidence: 99%

Automatic Construction of Educational Knowledge Graphs: A Word Embedding-Based Approach

Ain,

Chatti,

Bakar

et al. 2023

Information

View full text Add to dashboard Cite

Knowledge graphs (KGs) are widely used in the education domain to offer learners a semantic representation of domain concepts from educational content and their relations, termed as educational knowledge graphs (EduKGs). Previous studies on EduKGs have incorporated concept extraction and weighting modules. However, these studies face limitations in terms of accuracy and performance. To address these challenges, this work aims to improve the concept extraction and weighting mechanisms by leveraging state-of-the-art word and sentence embedding techniques. Concretely, we enhance the SIFRank keyphrase extraction method by using SqueezeBERT and we propose a concept-weighting strategy based on SBERT. Furthermore, we conduct extensive experiments on different datasets, demonstrating significant improvements over several state-of-the-art keyphrase extraction and concept-weighting techniques.

show abstract

“…Pre-trained transformers can be then effectively fine-tuned for downstream supervised tasks, usually with little to none architectural changes. Among the considerable variety of pre-trained transformer models, BERT [2] and its derivatives [17][18][19][20][21] have become the de facto standard for deep language modeling. The strength of this model comes from the bidirectional pretraining strategy, which leverages a huge amount of unlabeled text in an unsupervised fashion.…”

Section: Transformer Based Language Modelingmentioning

confidence: 99%

DCT-Former: Efficient Self-Attention with Discrete Cosine Transform

Scribano¹,

Franchini²,

Prato³

et al. 2022

Preprint

View full text Add to dashboard Cite

Since their introduction the Trasformer architectures emerged as the dominating architectures for both natural language processing and, more recently, computer vision applications. An intrinsic limitation of this family of "fully-attentive" architectures arises from the computation of the dot-product attention, which grows both in memory consumption and number of operations as O(n 2 ) where n stands for the input sequence length, thus limiting the applications that require modeling very long sequences. Several approaches have been proposed so far in the literature to mitigate this issue, with varying degrees of success. Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module by leveraging the properties of the Discrete Cosine Transform. An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time. This makes it particularly suitable in real-time contexts on embedded platforms. Moreover, we assume that the results of our research might serve as a starting point for a broader family of deep neural models with reduced memory footprint. The implementation will be made publicly available at https://github.com/cscribano/DCT-Former-Public.

show abstract

SqueezeBERT: What can computer vision teach NLP about efficient neural networks?

Cited by 10 publications

References 56 publications

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

Automatic Construction of Educational Knowledge Graphs: A Word Embedding-Based Approach

DCT-Former: Efficient Self-Attention with Discrete Cosine Transform

Contact Info

Product

Resources

About