“…However, these models often consume considerable storage, memory bandwidth, and computational resource. To reduce the model size and increase the inference throughput, compression techniques such as knowledge distillation (Sanh et al, 2019;Sun et al, 2019;Tang et al, 2019;Jiao et al, 2019;Sun et al, 2020) (Sanh et al, 2019) and BERT-PKD (Sun et al, 2019)) and iterative pruning methods (Iterative Pruning (Guo et al, 2019) and our proposed method) in terms of accuracy at various compression rate using MNLI test set. knowledge distillation methods require re-distillation from the teacher to get each single data point, whereas iterative pruning methods can produce continuous curves at once.…”