2022
DOI: 10.48550/arxiv.2205.08180
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

Abstract: We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a sentence (5-10s) such that the embedding vector space is semantically aligned across different languages. We combine… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(7 citation statements)
references
References 25 publications
0
7
0
Order By: Relevance
“…Other works explored multilingual and multimodal (speech/text) pre-training methods, including mSLAM (Bapna et al, 2022). Finally, Duquenne et al (2021, followed by Khurana et al (2022), introduced multilingual and multimodal sentence embeddings, extending a pre-existing multilingual text sentence embedding space to the speech modality with a distillation approach. Duquenne et al (2022bDuquenne et al ( , 2023) also showed that it is possible to efficiently decode multilingual speech sentence embeddings with decoders trained on text sentence embeddings into different languages, to perform zero-shot speech translation.…”
Section: Joint Speech/text Sentence Representationsmentioning
confidence: 99%
See 1 more Smart Citation
“…Other works explored multilingual and multimodal (speech/text) pre-training methods, including mSLAM (Bapna et al, 2022). Finally, Duquenne et al (2021, followed by Khurana et al (2022), introduced multilingual and multimodal sentence embeddings, extending a pre-existing multilingual text sentence embedding space to the speech modality with a distillation approach. Duquenne et al (2022bDuquenne et al ( , 2023) also showed that it is possible to efficiently decode multilingual speech sentence embeddings with decoders trained on text sentence embeddings into different languages, to perform zero-shot speech translation.…”
Section: Joint Speech/text Sentence Representationsmentioning
confidence: 99%
“…Bitext training data is used for this kind of training, where the sentence in the new language is encoded with a trained encoder, while its translation in another supported language is encoded with the pre-existing encoder as target. The same teacher-student approach can be used to extend a text-only multilingual sentence embedding space to the speech modality by training speech encoders (Duquenne et al, 2021;Khurana et al, 2022). These speech encoders can be used to perform speech-to-text or speech-to-speech translation mining (Duquenne et al, 2022a).…”
Section: Introductionmentioning
confidence: 99%
“…The first approach, S-HuBERT, is to transfer the knowledge of a well-learned text embedding model to a speech embedding model Khurana et al, 2022), in which pretrained supervised textual embeddings are adopted as the teacher models and speech models are trained to align with the text embeddings in the same latent space.…”
Section: S-hubertmentioning
confidence: 99%
“…In recent works, it has been shown that spoken sentence semantic similarities can be learned via the visually grounded speech models (Merkx et al, 2021). Multilingual spoken sentence embeddings can also be learned by using supervised multilingual text models as teacher models Khurana et al, 2022). These methods more or less relied on labeled data such as speechimage pairs or multilingual sentence pairs.…”
Section: Introductionmentioning
confidence: 99%
“…(Fan et al, 2020;. Finally, existing sentence embedding spaces can be extended to new languages (Reimers and Gurevych, 2020;Heffernan et al, 2022) or the speech modality (Duquenne et al, 2021;Khurana et al, 2022) with knowledge distillation, also called teacher-student approach. These multilingual and multimodal sentence embeddings enabled to perform large-scale speech-text mining, or speechspeech mining for a small set of languages.…”
Section: Introductionmentioning
confidence: 99%