“…Knowledge distillation is extensively studied in the field of natural language processing (NLP), in order to obtain the lightweight, efficient and effective language models. More and more KD methods are proposed for solving the numerous NLP tasks (Liu et al, 2019b;Gordon and Duh, 2019;Haidar and Rezagholizadeh, 2019;Yang et al, 2020b;Tang et al, 2019;Hu et al, 2018;Nakashole and Flauger, 2017;Jiao et al, 2019;Wang et al, 2018c;Zhou et al, 2019a;Sanh et al, 2019;Turc et al, 2019;Arora et al, 2019;Clark et al, 2019;Kim and Rush, 2016;Mou et al, 2016;Liu et al, 2019e;Hahn and Choi, 2019;Kuncoro et al, 2016;Cui et al, 2017;Wei et al, 2019;Freitag et al, 2017;Shakeri et al, 2019;Aguilar et al, 2020). The existing NLP tasks using KD contain neural machine translation (NMT) (Hahn and Choi, 2019;Zhou et al, 2019a;Kim and Rush, 2016;Wei et al, 2019;Freitag et al, 2017;Gordon and Duh, 2019), question answering system (Wang et al, 2018c;Arora et al, 2019;Yang et al, 2020b;Hu et al, 2018), document retrieval (Shakeri et al, 2019), event detection (Liu et al, 2019b), text generation (Haidar and Rezagholizadeh, 2019)...…”