Learning from Multiple Teacher Networks

You, Suya; Xu, Chang; Xu, Chao; Tao, Dacheng

doi:10.1145/3097983.3098135

Cited by 313 publications

(158 citation statements)

References 11 publications

Supporting

Mentioning

158

Contrasting

Order By: Relevance

“…Motivated by ensemble learning methods, You et.al. [22] simultaneously utilized multiple teacher networks to learn a better student network. Moreover, several algorithms have been developed to investigate the restriction between teacher and student.…”

Section: Knowledge Distillationmentioning

confidence: 99%

See 1 more Smart Citation

Learning Student Networks via Feature Embedding

Chen

Wang

et al. 2021

IEEE Trans. Neural Netw. Learning Syst.

Self Cite

View full text Add to dashboard Cite

Deep convolutional neural networks have been widely used in numerous applications, but their demanding storage and computational resource requirements prevent their applications on mobile devices. Knowledge distillation aims to optimize a portable student network by taking the knowledge from a well-trained heavy teacher network. Traditional teacher-student based methods used to rely on additional fully-connected layers to bridge intermediate layers of teacher and student networks, which brings in a large number of auxiliary parameters. In contrast, this paper aims to propagate information from teacher to student without introducing new variables which need to be optimized. We regard the teacher-student paradigm from a new perspective of feature embedding. By introducing the locality preserving loss, the student network is encouraged to generate the low-dimensional features which could inherit intrinsic properties of their corresponding high-dimensional features from teacher network. The resulting portable network thus can naturally maintain the performance as that of the teacher network. Theoretical analysis is provided to justify the lower computation complexity of the proposed method. Experiments on benchmark datasets and well-trained networks suggest that the proposed algorithm is superior to state-of-the-art teacher-student learning methods in terms of computational and storage complexity.

show abstract

Section: Knowledge Distillationmentioning

confidence: 99%

“…You et.al. [22] simultaneously utilized multiple teacher networks for learning a more accurate student network. Zagoruyko et.al.…”

Section: A Teacher-student Interactionsmentioning

confidence: 99%

Learning Student Networks via Feature Embedding

Chen

Wang

et al. 2021

IEEE Trans. Neural Netw. Learning Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Lee et al [17] studied the performance of different ensemble methods under the framework of multi-task learning. You et al [29] presented a method to train a thin deep network by incorporating in the intermediate layers and imposing a constraint about the dissimilarity among examples. Wu et al [27] propose a multi-teacher knowledge distillation framework for compressed video action recognition to compress this model.…”

Section: Multi-task Learningmentioning

confidence: 99%

Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System

Yang

Shou

Gong

et al. 2020

Proceedings of the 13th International Conference on Web Search and Data Mining

View full text Add to dashboard Cite

Deep pre-training and fine-tuning models (such as BERT and Ope-nAI GPT) have demonstrated excellent results in question answering areas. However, due to the sheer amount of model parameters, the inference speed of these models is very slow. How to apply these complex models to real business scenarios becomes a challenging but practical problem. Previous model compression methods usually suffer from information loss during the model compression procedure, leading to inferior models compared with the original one. To tackle this challenge, we propose a Two-stage Multi-teacher Knowledge Distillation (TMKD for short) method for web Question Answering system. We first develop a general Q&A distillation task for student model pre-training, and further fine-tune this pretrained student model with multi-teacher knowledge distillation on downstream tasks (like Web Q&A task, MNLI, SNLI, RTE tasks from GLUE), which effectively reduces the overfitting bias in individual teacher models, and transfers more general knowledge to the student model. The experiment results show that our method can significantly outperform the baseline methods and even achieve comparable results with the original teacher models, along with substantial speedup of model inference.

show abstract

“…Transfer learning is proposed to transfer knowledge from source domain to target domain to save data on target domain [24]. It contains two main research directions: cross-domain transfer learning [22,12,10,4] and cross-task one [9,3,5,35]. In the case of cross-domain transfer learning, the dataset adopted by source domain and the counterpart of target domain are different in domain but the same in category.…”

Section: Related Workmentioning

confidence: 99%

Customizing Student Networks From Heterogeneous Teachers via Adaptive Knowledge Amalgamation

Shen

Xue

Wang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

A massive number of well-trained deep networks have been released by developers online. These networks may focus on different tasks and in many cases are optimized for different datasets. In this paper, we study how to exploit such heterogeneous pre-trained networks, known as teachers, so as to train a customized student network that tackles a set of selective tasks defined by the user. We assume no human annotations are available, and each teacher may be either single-or multi-task. To this end, we introduce a dualstep strategy that first extracts the task-specific knowledge from the heterogeneous teachers sharing the same sub-task, and then amalgamates the extracted knowledge to build the student network. To facilitate the training, we employ a selective learning scheme where, for each unlabelled sample, the student learns adaptively from only the teacher with the least prediction ambiguity. We evaluate the proposed approach on several datasets and experimental results demonstrate that the student, learned by such adaptive knowledge amalgamation, achieves performances even better than those of the teachers.

show abstract

Learning from Multiple Teacher Networks

Cited by 313 publications

References 11 publications

Learning Student Networks via Feature Embedding

Learning Student Networks via Feature Embedding

Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System

Customizing Student Networks From Heterogeneous Teachers via Adaptive Knowledge Amalgamation

Contact Info

Product

Resources

About