Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2599
|View full text |Cite
|
Sign up to set email alerts
|

Analyzing Phonetic and Graphemic Representations in End-to-End Automatic Speech Recognition

Abstract: End-to-end neural network systems for automatic speech recognition (ASR) are trained from acoustic features to text transcriptions. In contrast to modular ASR systems, which contain separately-trained components for acoustic modeling, pronunciation lexicon, and language modeling, the end-to-end paradigm is both conceptually simpler and has the potential benefit of training the entire system on the end task. However, such neural network models are more opaque: it is not clear how to interpret the role of differ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
59
1
4

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 56 publications
(65 citation statements)
references
References 25 publications
1
59
1
4
Order By: Relevance
“…Finally, we hope to gain some understanding into why pretraining on ASR helps with AST, and specifically how the neural network representations change during pretraining and fine-tuning. We follow [35] and [10], who built diagnostic classifiers [36] to examine the representation of phonetic information in end-to-end ASR and AST systems, respectively. Unlike [10,35], who used non-linear classifiers, we use a linear classifier to predict phone labels from the internal representations of the trained ASR or AST model.…”
Section: Analyzing the Models' Representationsmentioning
confidence: 99%
See 1 more Smart Citation
“…Finally, we hope to gain some understanding into why pretraining on ASR helps with AST, and specifically how the neural network representations change during pretraining and fine-tuning. We follow [35] and [10], who built diagnostic classifiers [36] to examine the representation of phonetic information in end-to-end ASR and AST systems, respectively. Unlike [10,35], who used non-linear classifiers, we use a linear classifier to predict phone labels from the internal representations of the trained ASR or AST model.…”
Section: Analyzing the Models' Representationsmentioning
confidence: 99%
“…We follow [35] and [10], who built diagnostic classifiers [36] to examine the representation of phonetic information in end-to-end ASR and AST systems, respectively. Unlike [10,35], who used non-linear classifiers, we use a linear classifier to predict phone labels from the internal representations of the trained ASR or AST model. Using a linear classifier allows us to make more precise claims: if the classifier performs better using the representation from a particular layer, we can say that layer represents the phonetic information in a more linearly separable way.…”
Section: Analyzing the Models' Representationsmentioning
confidence: 99%
“…Although there is another choice of the CTC-based model for the A2C model as in [12], we adopt the attention-based model because character-level CTC models are more likely to misspell than attention-based models [10,11]. The character-level decoder can be connected to the arbitrary intermediate layer [23,24]. The overall loss function is the linear interpolation of the negative log-likelihood between the A2W and A2C models by a tunable parameter λ (0 ≤ λ ≤ 1):…”
Section: Multi-task Learning With Attention-based A2c Modelmentioning
confidence: 99%
“…better comprehending the complex, highly nonlinear transformations inside the network. Previous research analyzing end-to-end ASR involves investigating the underlying phonetic representations learned in the course of training [2,3,4]. Interpretable filters with SincNet [5] is proposed and shown capable of removing noise after training.…”
Section: Introductionmentioning
confidence: 99%