“…In recent years, Knowledge Distillation for BERTlike models (Devlin et al, 2019;Liu et al, 2019) has been extensively studied, leveraging intermediate layer matching (Ji et al, 2021;, data augmentation (Fu et al, 2020;Jiao et al, 2020;Kamalloo et al, 2021), adversarial training (Zaharia et al, 2021;Rashid et al, 2020, and lately loss terms re-weighting (Clark et al, 2019;Zhou et al, 2021;Jafari et al, 2021). In this work, we explore the latter direction with a meta learning approach (Li et al, 2019;Fan et al, 2020).…”