Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-614
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Knowledge Distillation from an Ensemble of Teachers

Abstract: This paper describes the effectiveness of knowledge distillation using teacher student training for building accurate and compact neural networks. We show that with knowledge distillation, information from multiple acoustic models like very deep VGG networks and Long Short-Term Memory (LSTM) models can be used to train standard convolutional neural network (CNN) acoustic models for a variety of systems requiring a quick turnaround. We examine two strategies to leverage multiple teacher labels for training stud… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
126
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 212 publications
(127 citation statements)
references
References 11 publications
1
126
0
Order By: Relevance
“…The VGG model has the same context dependent phones in the output layer as the parent CNN model. Then, the student CNN model was trained as the parent CNN model with a teacher-student training framework with the same 3600 hours of audio data after soft labels were generated from the VGG model [19].…”
Section: Data and Model Buildsmentioning
confidence: 99%
“…The VGG model has the same context dependent phones in the output layer as the parent CNN model. Then, the student CNN model was trained as the parent CNN model with a teacher-student training framework with the same 3600 hours of audio data after soft labels were generated from the VGG model [19].…”
Section: Data and Model Buildsmentioning
confidence: 99%
“…It performed well in terms of speaker individuality and root mean square log-spectral distortion (RMS-LSD). Additionally, non-learning BWE approaches are also reported [9], [17]- [20] in recent years.…”
Section: Introductionmentioning
confidence: 99%
“…Instead of augmenting the data by applying various kinds of signal distortions to the input acoustic features as is often done, our idea is to augment the training data by creating multiple copies with different labels that reflect the corresponding output targets of each task. A similar idea has been recently used in knowledge distillation [25]. They applied the AS to the multi-teacher student approach where update is performed sequencially in each computation of each teacher loss instead of using linear interpolated losses; achieving better performance than models trained from scratch or using interpolated loss functions.…”
Section: Introductionmentioning
confidence: 99%