2018
DOI: 10.1587/transinf.2017edp7175
|View full text |Cite
|
Sign up to set email alerts
|

Learning Supervised Feature Transformations on Zero Resources for Improved Acoustic Unit Discovery

Abstract: SUMMARYIn this work we utilize feature transformations that are common in supervised learning without having prior supervision, with the goal to improve Dirichlet process Gaussian mixture model (DPGMM) based acoustic unit discovery. The motivation of using such transformations is to create feature vectors that are more suitable for clustering. The need of labels for these methods makes it difficult to use them in a zero resource setting. To overcome this issue we utilize a first iteration of DPGMM clustering t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4

Citation Types

0
4
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 10 publications
(4 citation statements)
references
References 20 publications
0
4
0
Order By: Relevance
“…While similar to voice con-version [7,8], an explicit goal of ZeroSpeech 2019 is to learn low-bitrate representations that perform well on phone discrimination tests. In contrast to work on continuous representation learning [9][10][11][12][13], this encourages participants to find discrete units that correspond to distinct phones. 1 Early approaches to acoustic unit discovery typically combined clustering methods with hidden Markov models [15][16][17][18][19].…”
Section: Introductionmentioning
confidence: 99%
“…While similar to voice con-version [7,8], an explicit goal of ZeroSpeech 2019 is to learn low-bitrate representations that perform well on phone discrimination tests. In contrast to work on continuous representation learning [9][10][11][12][13], this encourages participants to find discrete units that correspond to distinct phones. 1 Early approaches to acoustic unit discovery typically combined clustering methods with hidden Markov models [15][16][17][18][19].…”
Section: Introductionmentioning
confidence: 99%
“…To address this and allow the learning of improved frame-level acoustic features, we build on recent work in "zero-resource" speech processing, where the goal is to learn robust feature representations without access to any labelled speech data (Versteegh et al, 2016;Dunbar et al, 2017Dunbar et al, , 2019. Various different features and learning approaches have been considered ranging from conventional speech features (Carlin et al, 2011;Vavrek et al, 2012;Lopez-Otero et al, 2016), to posteriorgrams from probabilistic mixture models (Zhang and Glass, 2009;Heck et al, 2017;Heck et al, 2018), to latent representations computed by neural networks (Badino et al, 2015;Renshaw et al, 2015;Zeghidour et al, 2016;Riad et al, 2018;Eloff et al, 2019). Among these, multilingual bottleneck feature (BNF) extractors, trained on well-resourced but out-of-domain languages, have been found by several authors to improve on the performance of MFCCs and other representations (Veselỳ et al, 2012;Vu et al, 2012;Thomas et al, 2012;Cui et al, 2015;Alumäe et al, 2016;Chen et al, 2017;Yuan et al, 2017;Hermann and Goldwater, 2018;Hermann et al, 2021).…”
Section: Introductionmentioning
confidence: 99%
“…The goal in unsupervised representation learning of phone units is to learn features which capture phonetic contrasts while being invariant to properties like the speaker or channel. Early approaches focussed on learning continuous features [6][7][8][9][10]. In an attempt to better match the categorical nature of true phonetic units, more recent work has considered discrete representations [11][12][13][14][15][16][17].…”
Section: Introductionmentioning
confidence: 99%