Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-678
|View full text |Cite
|
Sign up to set email alerts
|

Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study

Abstract: Several variants of deep neural networks have been successfully employed for building parametric models that project variableduration spoken word segments onto fixed-size vector representations, or acoustic word embeddings (AWEs). However, it remains unclear to what degree we can rely on the distance in the emerging AWE space as an estimate of word-form similarity. In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? To answer this question, we emp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 28 publications
1
3
0
Order By: Relevance
“…We also do not observe the West Slavic group as in the other two models since Polish was grouped first with Russian, and not Czech. We believe that this unexpected behavior of the CSE models can be related to the previously attributed to the poor performance of the contrastive AWEs in capturing word-form similarity (Abdullah et al, 2021). Moreover, it is interesting to observe that German seems to be the most distant language to the other languages in our study.…”
Section: Cross-lingual Comparisonsupporting
confidence: 52%
See 1 more Smart Citation
“…We also do not observe the West Slavic group as in the other two models since Polish was grouped first with Russian, and not Czech. We believe that this unexpected behavior of the CSE models can be related to the previously attributed to the poor performance of the contrastive AWEs in capturing word-form similarity (Abdullah et al, 2021). Moreover, it is interesting to observe that German seems to be the most distant language to the other languages in our study.…”
Section: Cross-lingual Comparisonsupporting
confidence: 52%
“…The phonologically guided encoder (PGE) is a sequence-to-sequence model in which the network is trained as a word-level acoustic model (Abdullah et al, 2021). Given an acoustic sequence A and its corresponding phonological sequence ϕ = (ϕ 1 , .…”
Section: Phonologically Guided Encodermentioning
confidence: 99%
“…We conduct an intrinsic evaluation for the AWEs to assess the performance of our models using the same-different acoustic word discrimination task with the mean average precision (mAP) metric [31,38,39]. Prior work has shown that performance on this task positively correlates with improvement on downstream QbE speech search [32].…”
Section: Resultsmentioning
confidence: 99%
“…We compare the performance of our proposed model to a strong baseline that explicitly minimizes the distance between exemplars of the same lexical category. The baseline model employs a contrastive triplet loss that has been extensively explored in the AWEs literature with different underlying architectures and has shown strong discriminative performance [9,[30][31][32]. Given a matching pair of AWEs (x a , x + )-i.e., embeddings of two exemplars of the same word type-the objective is then to minimize a triplet margin loss…”
Section: Baseline: Contrastive Acoustic Modelmentioning
confidence: 99%