“…In other words, we can distill the knowledge from one model (massive or teacher model) to another (small or student model). Previous work has shown that KD can significantly boost prediction accuracy in natural language processing and speech processing (Kim and Rush, 2016;Hu et al, 2018;Huang et al, 2018b;Hahn and Choi, 2019;Liu et al, 2021b,a;Cheng et al, 2016b;Cheng and You, 2016;Cheng et al, 2016a;You et al, 2020bYou et al, , 2021e, 2022bYou et al, , 2019aLyu et al, 2018Lyu et al, , 2019Guha et al, 2020;Yang et al, 2020;Ma et al, 2021a,b), while adopting KD-based methods for SQA tasks has been less explored. In this work, our goal is to handle the SCQA tasks.…”