Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-3084
|View full text |Cite
|
Sign up to set email alerts
|

A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

Abstract: Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Variational Autoencoders (VAEs), their use for speech representation learning remains largely unexplored. In this work, we propose Convolutional … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 25 publications
(13 citation statements)
references
References 10 publications
0
13
0
Order By: Relevance
“…Reconstruction is often implemented in the form of auto-encoding (van den Oord et al, 2017), where speech is first encoded into a low-dimensional space, and then decoded back to speech. Various constraints can be imposed on the encoded space, such as temporal smoothness (Ebbers et al, 2017;Glarner et al, 2018;Khurana et al, 2020), discreteness (Ondel et al, 2016;van den Oord et al, 2017), and presence of hierarchy (Hsu et al, 2017), information bottlenecks for speech representation decomposition (Qian et al, 2020(Qian et al, , 2021. Prediction-based approaches task a model with predicting information of unseen speech based on its context.…”
Section: Related Work Speech Emotion Conversionmentioning
confidence: 99%
“…Reconstruction is often implemented in the form of auto-encoding (van den Oord et al, 2017), where speech is first encoded into a low-dimensional space, and then decoded back to speech. Various constraints can be imposed on the encoded space, such as temporal smoothness (Ebbers et al, 2017;Glarner et al, 2018;Khurana et al, 2020), discreteness (Ondel et al, 2016;van den Oord et al, 2017), and presence of hierarchy (Hsu et al, 2017), information bottlenecks for speech representation decomposition (Qian et al, 2020(Qian et al, , 2021. Prediction-based approaches task a model with predicting information of unseen speech based on its context.…”
Section: Related Work Speech Emotion Conversionmentioning
confidence: 99%
“…This makes ACPC somewhat similar to an online variant of pseudo-labeling [21]: the prediction network produces pseudolabels which after a forcealignment serve as targets for the encoder. [22] can be seen as another top-down approach which models Markov-based latent transitions and emissions with neural networks, in contrast ACPC does not model explicitly the representation dynamics.…”
Section: Related Workmentioning
confidence: 99%
“…Related stochastic sequential neural models were reported by Fraccaro et al (2016) and Chung et al (2015). Published applications of DMMs include natural language processing tasks (Khurana et al, 2020), inference of time series data (Zhi-Xuan et al, 2020), and human pose forecasting (Toyer et al, 2017). Our application of the DMM and the modifications made to the standard model will be described in section 3.3.…”
Section: Deep Markov Modelsmentioning
confidence: 99%