Not Far Away, Not So Close: Sample Efficient Nearest Neighbour Data Augmentation via MiniMax

Kamalloo, Ehsan; Rezagholizadeh, Mehdi; Passban, Peyman; Ghodsi, Ali

doi:10.18653/v1/2021.findings-acl.309

Cited by 10 publications

(8 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Knowledge distillation (Hinton et al, 2015;Buciluǎ et al, 2006;Gao et al, 2018;Kamalloo et al, 2021;Rashid et al, 2020) has emerged as an important algorithm in language model compression (Jiao et al, 2020;Sanh et al, 2020;.…”

Section: Knowledge Distillation (Kd)mentioning

confidence: 99%

See 1 more Smart Citation

How to Select One Among All ? An Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding

Li¹,

Ahmad²,

Jafari³

et al. 2021

Findings of the Association for Computational Linguistics: EMNLP 2021

Self Cite

View full text Add to dashboard Cite

Knowledge Distillation (KD) is a model compression algorithm that helps transfer the knowledge of a large neural network into a smaller one. Even though KD has shown promise on a wide range of Natural Language Processing (NLP) applications, little is understood about how one KD algorithm compares to another and whether these approaches can be complimentary to each other. In this work, we evaluate various KD algorithms on in-domain, out-of-domain and adversarial testing. We propose a framework to assess the adversarial robustness of multiple KD algorithms. Moreover, we introduce a new KD algorithm, Combined-KD 1 , which takes advantage of two promising approaches (better training scheme and more efficient data augmentation). Our extensive experimental results show that Combined-KD achieves state-of-theart results on the GLUE benchmark, out-ofdomain generalization, and adversarial robustness compared to competitive methods.

show abstract

Section: Knowledge Distillation (Kd)mentioning

confidence: 99%

“…2) Employing data-augmentation (Jiao et al, 2020;Fu et al, 2020;Rashid et al, 2021;Kamalloo et al, 2021) to improve KD by using more diverse training data. It is difficult to compare these methods since, typically, the teachers and students are initialized differently.…”

Section: Introductionmentioning

confidence: 99%

How to Select One Among All ? An Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding

Li¹,

Ahmad²,

Jafari³

et al. 2021

Findings of the Association for Computational Linguistics: EMNLP 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…In recent years, Knowledge Distillation for BERTlike models (Devlin et al, 2019;Liu et al, 2019) has been extensively studied, leveraging intermediate layer matching (Ji et al, 2021;, data augmentation (Fu et al, 2020;Jiao et al, 2020;Kamalloo et al, 2021), adversarial training (Zaharia et al, 2021;Rashid et al, 2020, and lately loss terms re-weighting (Clark et al, 2019;Zhou et al, 2021;Jafari et al, 2021). In this work, we explore the latter direction with a meta learning approach (Li et al, 2019;Fan et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation

Peng¹,

Ghaddar²,

Ahmad³

et al. 2021

Findings of the Association for Computational Linguistics: EMNLP 2021

Self Cite

View full text Add to dashboard Cite

Knowledge Distillation (KD) is extensively used in Natural Language Processing to compress the pre-training and task-specific finetuning phases of large neural language models. A student model is trained to minimize a convex combination of the prediction loss over the labels and another over the teacher output. However, most existing works either fix the interpolating weight between the two losses apriori or vary the weight using heuristics. In this work, we propose a novel samplewise loss weighting method, RW-KD. A metalearner, simultaneously trained with the student, adaptively re-weights the two losses for each sample. We demonstrate, on 7 datasets of the GLUE benchmark, that RW-KD outperforms other loss re-weighting methods for KD.

show abstract

“…Jiao et al (2019) show a two-stage KD with intermediate layer mapping, attention distillation and embedding distillation for BERTbased models. Mate-KD (Rashid et al, 2021) and MiniMax-KNN KD (Kamalloo et al, 2021) tailor data augmentation for KD, in which augmented samples are generated or selected based on maximum divergence loss between the student and teacher networks. Rashid et al (2020) propose a zero-shot KD technique in NLP in which the student does not need to access the teacher training data for its training.…”

Section: Introductionmentioning

confidence: 99%

Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

Rezagholizadeh,

Jafari,

Salad

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

With ever growing scale of neural models, knowledge distillation (KD) attracts more attention as a prominent tool for neural model compression. However, there are counter intuitive observations in the literature showing some challenging limitations of KD. A case in point is that the best performing checkpoint of the teacher might not necessarily be the best teacher for training the student in KD. Therefore, one important question would be how to find the best checkpoint of the teacher for distillation? Searching through the checkpoints of the teacher would be a very tedious and computationally expensive process, which we refer to as the checkpoint-search problem. Moreover, another observation is that larger teachers might not necessarily be better teachers in KD which is referred to as the capacitygap problem. To address these challenging problems, in this work, we introduce our progressive knowledge distillation (Pro-KD) technique which defines a smoother training path for the student by following the training footprints of the teacher instead of solely relying on distilling from a single mature fully-trained teacher. We demonstrate that our technique is quite effective in mitigating the capacity-gap problem and the checkpoint search problem. We evaluate our technique using a comprehensive set of experiments on different tasks such as image classification (CIFAR-10 and CIFAR-100), natural language understanding tasks of the GLUE benchmark, and question answering (SQuAD 1.1 and 2.0) using BERT-based models and consistently got superior results over state-of-the-art techniques.

show abstract

Not Far Away, Not So Close: Sample Efficient Nearest Neighbour Data Augmentation via MiniMax

Cited by 10 publications

References 26 publications

How to Select One Among All ? An Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding

How to Select One Among All ? An Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding

RW-KD: Sample-wise Loss Terms Re-Weighting for Knowledge Distillation

Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

Contact Info

Product

Resources

About