2020
DOI: 10.1109/lsp.2020.2973798
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Feature Learning for Speech Using Correspondence and Siamese Networks

Abstract: In zero-resource settings where transcribed speech audio is unavailable, unsupervised feature learning is essential for downstream speech processing tasks. Here we compare two recent methods for frame-level acoustic feature learning. For both methods, unsupervised term discovery is used to find pairs of word examples of the same unknown type. Dynamic programming is then used to align the feature frames between each word pair, serving as weak top-down supervision for the two models. For the correspondence autoe… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
15
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
5

Relationship

2
8

Authors

Journals

citations
Cited by 19 publications
(16 citation statements)
references
References 35 publications
1
15
0
Order By: Relevance
“…While similar to voice con-version [7,8], an explicit goal of ZeroSpeech 2019 is to learn low-bitrate representations that perform well on phone discrimination tests. In contrast to work on continuous representation learning [9][10][11][12][13], this encourages participants to find discrete units that correspond to distinct phones. 1 Early approaches to acoustic unit discovery typically combined clustering methods with hidden Markov models [15][16][17][18][19].…”
Section: Introductionmentioning
confidence: 99%
“…While similar to voice con-version [7,8], an explicit goal of ZeroSpeech 2019 is to learn low-bitrate representations that perform well on phone discrimination tests. In contrast to work on continuous representation learning [9][10][11][12][13], this encourages participants to find discrete units that correspond to distinct phones. 1 Early approaches to acoustic unit discovery typically combined clustering methods with hidden Markov models [15][16][17][18][19].…”
Section: Introductionmentioning
confidence: 99%
“…Moreover, learning disentangled representations that reflect users' preferences can bring enhanced robustness, interpretability, and controllability. We will, in future, seek to combine different techniques like adversarial training [32] and Siamese networks [45] with disentanglement, or add further constraints grounded in information theory, to improve learning such disentangled representations from users' signals.…”
Section: Discussion and Future Workmentioning
confidence: 99%
“…Distance between output representations for the anchor and its nearby point is trained to be minimized whilst the distance between representations for the anchor and the distant point is maximized. Both CAE and Triamese networks have also been proposed to complement one another as a joint model, referred to as Correspondence Triamese Auto-Encoder (CTAE) (Last, Engelbrecht & Kamper, 2020). CTAE learns to minimize triplet loss in the representation layer whilst using the two similar points from the triplet as input-output pair for reconstruction.…”
Section: Geometric Distance In Speech Representation Learningmentioning
confidence: 99%