Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-461
|View full text |Cite
|
Sign up to set email alerts
|

Multilingual Transfer of Acoustic Word Embeddings Improves When Training on Languages Related to the Target Zero-Resource Language

Abstract: Acoustic word embedding models map variable duration speech segments to fixed dimensional vectors, enabling efficient speech search and discovery. Previous work explored how embeddings can be obtained in zero-resource settings where no labelled data is available in the target language. The current best approach uses transfer learning: a single supervised multilingual model is trained using labelled data from multiple well-resourced languages and then applied to a target zero-resource language (without fine-tun… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 52 publications
1
5
0
Order By: Relevance
“…Learning A recent study on zero-resource AWEs has shown that cross-lingual transfer is more successful when the source (L1) and target (L2) languages are more related (Jacobs and Kamper, 2021). We conducted a preliminary experiment on the cross-lingual word discrimination performance of the models in our study and observed a similar effect.…”
Section: Implications On Cross-lingual Transfersupporting
confidence: 58%
“…Learning A recent study on zero-resource AWEs has shown that cross-lingual transfer is more successful when the source (L1) and target (L2) languages are more related (Jacobs and Kamper, 2021). We conducted a preliminary experiment on the cross-lingual word discrimination performance of the models in our study and observed a similar effect.…”
Section: Implications On Cross-lingual Transfersupporting
confidence: 58%
“…We conduct an intrinsic evaluation for the AWEs to assess the performance of our models using the same-different acoustic word discrimination task with the mean average precision (mAP) metric [31,38,39]. Prior work has shown that performance on this task positively correlates with improvement on downstream QbE speech search [32]. This task evaluates the ability of the model to determine whether two given speech segments correspond to the same word type-that is, whether or not two acoustic segments are exemplars of the same category.…”
Section: Resultsmentioning
confidence: 99%
“…We compare the performance of our proposed model to a strong baseline that explicitly minimizes the distance between exemplars of the same lexical category. The baseline model employs a contrastive triplet loss that has been extensively explored in the AWEs literature with different underlying architectures and has shown strong discriminative performance [9,[30][31][32]. Given a matching pair of AWEs (x a , x + )-i.e., embeddings of two exemplars of the same word type-the objective is then to minimize a triplet margin loss…”
Section: Baseline: Contrastive Acoustic Modelmentioning
confidence: 99%
“…In addition, we also explore pooling functions with trainable parameters, such as in [6,7,10]. We follow [9,10,19] and train the pooling function g with a contrastive loss. Specifically, we use NTXent [20] which is defined as…”
Section: Task Overviewmentioning
confidence: 99%
“…Most previous work on constructing unsupervised AWEs has approached the problem using learned pooling, where positive training pairs of similar speech segments (assumed to be the same word or n-gram) are used to learn a pooling function, based on a reconstruction [6,7,8] or contrastive [9,10] objective. Despite good AWE quality, these methods rely on identifying positive training pairs from a corpus using k-nearestneighbors methods [10,11].…”
Section: Introductionmentioning
confidence: 99%