ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054675
|View full text |Cite
|
Sign up to set email alerts
|

What Does a Network Layer Hear? Analyzing Hidden Representations of End-to-End ASR Through Speech Synthesis

Abstract: End-to-end speech recognition systems have achieved competitive results compared to traditional systems. However, the complex transformations involved between layers given highly variable acoustic signals are hard to analyze. In this paper, we present our ASR probing model, which synthesizes speech from hidden representations of end-to-end ASR to examine the information maintain after each layer calculation. Listening to the synthesized speech, we observe gradual removal of speaker variability and noise as the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(17 citation statements)
references
References 16 publications
0
17
0
Order By: Relevance
“…One notable exception is [17], which finds that the hidden layers of these networks outperform traditional features for a range of tasks, including speaker identification, emotion classification, and speechto-text. [18] found that the learned features become progressively more abstract at higher layers, normalizing for dimensions such as speaker, channel, and environmental conditions. In response to this sparse landscape, we asked the following questions:…”
Section: Introductionmentioning
confidence: 98%
“…One notable exception is [17], which finds that the hidden layers of these networks outperform traditional features for a range of tasks, including speaker identification, emotion classification, and speechto-text. [18] found that the learned features become progressively more abstract at higher layers, normalizing for dimensions such as speaker, channel, and environmental conditions. In response to this sparse landscape, we asked the following questions:…”
Section: Introductionmentioning
confidence: 98%
“…In the audio domain such an approach using transfer learning was applied to study the capability of the layers of a pre-trained network to extract meaningful information for music classification and regression [51]. The role of each layer in an end-to-end speech recognition systems has been studied in [52]. The main idea is to synthesize speech signals from the hidden representations of each layer.…”
Section: Relation With Previous Workmentioning
confidence: 99%
“…Elloumi et al (2018) use auxiliary classifiers to predict the underlying style of speech as being spontaneous or non-spontaneous and as having a native or non-native accent; their main task was to predict the performance of an ASR system on unseen broadcast programs. Analogous to saliency maps used to analyze images, Li et al (2020) propose reconstructing speech from the hidden representations at each layer using highway networks. Apart from ASR, analysis techniques have also been used with speaker embeddings for the task of speaker recognition (Wang et al, 2017).…”
Section: Analysis Of Asr Modelsmentioning
confidence: 99%