Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-1079
|View full text |Cite
|
Sign up to set email alerts
|

Distilling Knowledge from an Ensemble of Models for Punctuation Prediction

Abstract: This paper proposes an approach to distill knowledge from an ensemble of models to a single deep neural network (DNN) student model for punctuation prediction. This approach makes the DNN student model mimic the behavior of the ensemble. The ensemble consists of three single models. Kullback-Leibler (KL) divergence is used to minimize the difference between the output distribution of the DNN student model and the behavior of the ensemble. Experimental results on English IWSLT2011 dataset show that the ensemble… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
32
0
2

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 28 publications
(34 citation statements)
references
References 22 publications
0
32
0
2
Order By: Relevance
“…In another study, Gale and Parthasarathy (2017) used character-level LSTM architecture to achieve results that are competitive with the wordlevel CRF based approach. Yi et al (2017) combined bidirectional LSTM with a CRF layer and an ensemble of three networks. They further used knowledge distillation to transfer knowledge from the ensemble of networks to a single DNN network.…”
Section: Related Workmentioning
confidence: 99%
“…In another study, Gale and Parthasarathy (2017) used character-level LSTM architecture to achieve results that are competitive with the wordlevel CRF based approach. Yi et al (2017) combined bidirectional LSTM with a CRF layer and an ensemble of three networks. They further used knowledge distillation to transfer knowledge from the ensemble of networks to a single DNN network.…”
Section: Related Workmentioning
confidence: 99%
“…Full-Transformer denotes the model that replaces the CT-Transformer block in Figure 1 with a full sequence Transformer. The results of these models on the IWSLT2011 test set are reported in the last group in [19], respectively. The previous stateof-the-art model is Self-attention-word-speech [2], which used a full sequence Transformer encoder-decoder model with pre-trained word2vec and speech2vec embedding features.…”
Section: Resultsmentioning
confidence: 99%
“…Though non-sequential, several previous approaches use simpler network architectures (e.g. DNNs (Yi et al, 2017;Che et al, 2016) or CNNs (B. Garg and Anika, 2018;Che et al, 2016;Żelasko et al, 2018)), which have less predictive power.…”
Section: Related Workmentioning
confidence: 99%