2021
DOI: 10.48550/arxiv.2107.04734
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Layer-wise Analysis of a Self-supervised Speech Representation Model

Abstract: Recently proposed self-supervised learning approaches have been successful for pre-training speech representation models. The utility of these learned representations has been observed empirically, but not much has been studied about the type or extent of information encoded in the pre-trained representations themselves. Developing such insights can help understand the capabilities and limits of these models and enable the research community to more efficiently develop their usage for downstream applications. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
11
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(12 citation statements)
references
References 31 publications
1
11
0
Order By: Relevance
“…Finally, our ap- We conducted a layer-difference analysis study to determine which W2V2 layer contributes the most to the SER task. [20] examined the layer-specific information in W2V2 intermediate speech representations and found that various acoustic and linguistic properties are encoded in different layers. As such, we compared the hidden-state outputs from the first layer (initial embeddings), middle layer, and final layer and found a surprising difference in SER performance.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…Finally, our ap- We conducted a layer-difference analysis study to determine which W2V2 layer contributes the most to the SER task. [20] examined the layer-specific information in W2V2 intermediate speech representations and found that various acoustic and linguistic properties are encoded in different layers. As such, we compared the hidden-state outputs from the first layer (initial embeddings), middle layer, and final layer and found a surprising difference in SER performance.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…To get a finer-grained understanding of how linguistic representations evolve across these models, we turned away from the fMRI data and instead compared each layer directly to known linguistic features. Inspired by prior work in computer vision (Alain & Bengio, 2017), natural language processing (Ettinger et al, 2016;Shi et al, 2016) and speech (Pasad et al, 2021;Yang et al, 2021), we did this by linearly probing each layer's representations for spectral features (FBANK), spectrotemporal features, phoneme identity, and word identity.…”
Section: Probing Ssl Models For Linguistic Structurementioning
confidence: 99%
“…Our experiment comprised human participants passively listening to English-language narrative stories while their whole-brain fMRI BOLD activity was being recorded. Prior work on LM-based language encoding models has found performance differences across LM layers (Jain & Huth, 2018;Toneva & Wehbe, 2019), and layer-wise analyses of SSL models have shown that they capture different types of acoustic information (Pasad et al, 2021). Motivated by these findings, we built separate encoding models for each layer in four SSL models -APC, wav2vec, wav2vec 2.0 and HuBERT (Table 1).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Secondly, a very recent study reveals that similar to supervised learning, SSL too gets biased to the domain from which the unlabelled data originates [11]. Thirdly, as SSL implicitly learns a language model and other semantic information through the tasks it is subjected to solve [12], the generalizability of these models is only to the extent where data from a similar language or phonetic structure is introduced to it at finetuning. Thus, as correctly pointed out by [13], SSL for speech suffers from the problems of scale, and SSL generalizability can be improved with more efficient training procedures.…”
Section: Introductionmentioning
confidence: 99%