2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017
DOI: 10.1109/asru.2017.8269008
|View full text |Cite
|
Sign up to set email alerts
|

An embedded segmental K-means model for unsupervised segmentation and clustering of speech

Abstract: Unsupervised segmentation and clustering of unlabelled speech are core problems in zero-resource speech processing. Most approaches lie at methodological extremes: some use probabilistic Bayesian models with convergence guarantees, while others opt for more efficient heuristic techniques. Despite competitive performance in previous work, the full Bayesian approach is difficult to scale to large speech corpora. We introduce an approximation to a recent Bayesian model that still has a clear objective function bu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
82
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
3

Relationship

2
8

Authors

Journals

citations
Cited by 76 publications
(83 citation statements)
references
References 39 publications
0
82
1
Order By: Relevance
“…Speech audio is parametrised as D = 13 dimensional static Mel-frequency cepstral coefficients (MFCCs). We use an embedding dimensionality of M = 130 throughout, since downstream systems such as the segmentation and clustering system of [8] are constrained to embedding sizes of this order. All encoder-decoder models have 3 encoder and 3 decoder unidirectional RNN layers, each with 400 units.…”
Section: Methodsmentioning
confidence: 99%
“…Speech audio is parametrised as D = 13 dimensional static Mel-frequency cepstral coefficients (MFCCs). We use an embedding dimensionality of M = 130 throughout, since downstream systems such as the segmentation and clustering system of [8] are constrained to embedding sizes of this order. All encoder-decoder models have 3 encoder and 3 decoder unidirectional RNN layers, each with 400 units.…”
Section: Methodsmentioning
confidence: 99%
“…For all models we use an embedding dimensionality of M = 130, to be directly comparable to the downsampling baseline. More importantly, although other studies consider higherdimensional settings, downstream systems such as [14] are constrained to embedding sizes of this order. Neural network architectures were optimised on the English validation data.…”
Section: Experimental Setup and Evaluationmentioning
confidence: 99%
“…The present paper has a strong connection to recent work on unsupervised speech processing, especially the Zerospeech 2015 (Versteegh et al, 2015) and 2017 (Dunbar et al, 2017) shared tasks. Participating systems (Badino et al, 2015;Renshaw et al, 2015;Agenbag and Niesler, 2015;Baljekar et al, 2015;Räsänen et al, 2015;Lyzinski et al, 2015;Zeghidour et al, 2016;Heck et al, 2016;Srivastava and Shrivastava, 2016;Kamper et al, 2017b;Yuan et al, 2017;Heck et al, 2017;Shibata et al, 2017;Ansari et al, 2017a,b) perform unsupervised ABX discrimination and/or spoken term discovery on the basis of unlabeled speech alone. The design and evaluation of these and related systems (Kamper et al, , 2017aElsner and Shain, 2017;Räsänen et al, 2018) are oriented toward word-level modeling.…”
Section: Unsupervised Speech Processingmentioning
confidence: 99%