BERT Learns to Teach: Knowledge Distillation with Meta Learning

Zhou, Wangchunshu; Xu, Canwen; McAuley, Julian

doi:10.18653/v1/2022.acl-long.485

Cited by 33 publications

(26 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Teacher assistant-based distillation [14,15,17] is showcased to trade in teacher scale for student performance by inserting an intermediate-scale teacher assistant. This phenomenon is also supported in other work that better student performance should be attained with slightly worse teacher learning capacity [52]. However, setting the teacher assistant to a small scale with high performance for the student is nontrivial.…”

Section: Related Worksupporting

confidence: 57%

AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

Zhang¹,

Yang²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Driven by the teacher-student paradigm, knowledge distillation is one of the de facto ways for language model compression. Recent studies have uncovered that conventional distillation is less effective when facing a large capacity gap between the teacher and the student, and introduced teacher assistant-based distillation to bridge the gap. As a connection, the scale and the performance of the teacher assistant is crucial for transferring the knowledge from the teacher to the student. However, existing teacher assistant-based methods manually select the scale of the teacher assistant, which fails to identify the teacher assistant with the optimal scaleperformance tradeoff. To this end, we propose an Automatic Distillation Schedule (AUTODISC) for large language model compression. In particular, AUTODISC first specifies a set of teacher assistant candidates at different scales with gridding and pruning, and then optimizes all candidates in an once-for-all optimization with two approximations. The best teacher assistant scale is automatically selected according to the scale-performance tradeoff. AUTODISC is evaluated with an extensive set of experiments on a language understanding benchmark GLUE. Experimental results demonstrate the improved performance and applicability of our AUTODISC. We further apply AUTODISC on a language model with over one billion parameters and show the scalability of AUTODISC.Recent advances [15] have shown that conventional distillation suffers from severe performance decline when facing a large capacity gap between the teacher and the student. To alleviate the shortcoming, teacher assistant-based distillation [16] has been proposed, where the teacher is first distilled into an intermediate-scale teacher assistant. This teacher assistant then serves as an alternative teacher to transfer the knowledge to the student. While teacher assistant-based distillation generally Preprint. Under review.

show abstract

Section: Related Worksupporting

confidence: 57%

AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

Zhang¹,

Yang²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To further illustrate the superiority of our methods, we further compare the current typical distillation methods like previous work [41], as shown in Table 2. Results of baselines in Table 2 are reported by [51].…”

Section: Results On Cifar-100mentioning

confidence: 99%

“…In Table 4, we further explore the effect of the PESF-KD in NLP dataset, and other baselines are reported by [51]. Time in the Table 4 refer to training resources cost, which is the lowest consumption with our PESF-KD compared with other baselines except for vanilla KD.…”

Section: Results On Gluementioning

confidence: 99%

Parameter-Efficient and Student-Friendly Knowledge Distillation

Rao¹,

Meng²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Knowledge distillation (KD) has been extensively employed to transfer the knowledge from a large teacher model to the smaller students, where the parameters of the teacher are fixed (or partially) during training. Recent studies show that this mode may cause difficulties in knowledge transfer due to the mismatched model capacities. To alleviate the mismatch problem, teacher-student joint training methods, e.g., online distillation, have been proposed, but it always requires expensive computational cost. In this paper, we present a parameter-efficient and studentfriendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer by updating relatively few partial parameters. Technically, we first mathematically formulate the mismatch as the sharpness gap between their predictive distributions, where we show such a gap can be narrowed with the appropriate smoothness of the soft label. Then, we introduce an adapter module for the teacher, and only update the adapter to obtain soft labels with appropriate smoothness. Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods. Code will be released upon acceptance.

show abstract

“…In this work, we aim to leverage meta-learning in a more flexible manner by refining the pseudo-labels instead of reweighting. Approach-wise, the most related works are (Pham et al, 2021;Zhou et al, 2022) from computer vision and model distillation respectively, which also refine the teacher's parameters from student feedback. However, they work with samples from clean distributions, while we anticipate the noise memorization effect and enhance our framework with teacher warm-up and confidence filtering to suppress the error propagation.…”

Section: Related Workmentioning

confidence: 99%

Meta Self-Refinement for Robust Learning with Weak Supervision

Zhu¹,

Shen²,

Hedderich³

et al. 2022

Preprint

View full text Add to dashboard Cite

Training deep neural networks (DNNs) with weak supervision has been a hot topic as it can significantly reduce the annotation cost. However, labels from weak supervision can be rather noisy and the high capacity of DNNs makes them easy to overfit the noisy labels. Recent methods leverage self-training techniques to train noise-robust models, where a teacher trained on noisy labels is used to teach a student. However, the teacher from such models might fit a substantial amount of noise and produce wrong pseudo-labels with high confidence, leading to error propagation. In this work, we propose Meta Self-Refinement (MSR), a noise-resistant learning framework, to effectively combat noisy labels from weak supervision sources. Instead of purely relying on a fixed teacher trained on noisy labels, we keep updating the teacher to refine its pseudolabels. At each training step, it performs a meta gradient descent on the current minibatch to maximize the student performance on a clean validation set. Extensive experimentation on eight NLP benchmarks demonstrates that MSR is robust against noise in all settings and outperforms the state-of-the-art up to 11.4% in accuracy and 9.26% in F1 score.

show abstract

BERT Learns to Teach: Knowledge Distillation with Meta Learning

Cited by 33 publications

References 0 publications

AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

Parameter-Efficient and Student-Friendly Knowledge Distillation

Meta Self-Refinement for Robust Learning with Weak Supervision

Contact Info

Product

Resources

About