2022
DOI: 10.48550/arxiv.2206.02211
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

Abstract: The success of deep learning comes from its ability to capture the hierarchical structure of data by learning high-level representations defined in terms of low-level ones. In this paper we explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding (CPC). We observe that simply stacking two CPC models does not yield significant improvements over single-level architectures. Inspired by the fact that speech is often described as a seque… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 19 publications
0
2
0
Order By: Relevance
“…It was later extended by (Bhati et al, 2021) to segment the sequence of speech frames dynamically. Recently, (Cuervo et al, 2022) introduced a hierarchical sequence processing model in which units in the upper layer operate on a dynamically shortened sequence, with the shortening guided by a boundary prediction model. (Rocki et al, 2016) control the activity of LSTM gates with the model's output cross-entropy.…”
Section: Boundary Detectionmentioning
confidence: 99%
“…It was later extended by (Bhati et al, 2021) to segment the sequence of speech frames dynamically. Recently, (Cuervo et al, 2022) introduced a hierarchical sequence processing model in which units in the upper layer operate on a dynamically shortened sequence, with the shortening guided by a boundary prediction model. (Rocki et al, 2016) control the activity of LSTM gates with the model's output cross-entropy.…”
Section: Boundary Detectionmentioning
confidence: 99%
“…Further, a language model is utilized with beam search to decode the outputs of the acoustic model. Interestingly, the discrete representations enable the unsupervised discovery of acoustic units where phonemes are automatically mapped to a small set of discrete representations, enabling phoneme discovery and segmentation [ 54 , 55 , 56 , 57 ]. This resulting property of automatic discovery of ground truth phonemes is of particular interest, as we hypothesize that it allows us to derive the atomic units of human movements from wearable sensor data by learning a mapping of discrete representations to spans of sensor data.…”
Section: Introductionmentioning
confidence: 99%