Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models

Khorrami, Khazar; Räsänen, Okko

doi:10.21437/interspeech.2021-496

Cited by 7 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…VGS usually leverages image-speech [25,26] or video-speech [27,28] paired data. In practice, besides speech-image retrieval and alignment [29,30,31,32,33,34], VGS models has also be shown to achieves competitive performance keyword spotting [35], query-by-example research [36], and varies tasks in the SU-PERB benchmark [37,38]. The study of linguistic information learned in VGS models has been attracting increasing attention.…”

Section: Related Workmentioning

confidence: 99%

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan¹,

Harwath²

2022

Interspeech 2022

View full text Add to dashboard Cite

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4 other languages from the Zerospeech Challenge, in some cases beating the previous state-of-the-art. 1

show abstract

Section: Related Workmentioning

confidence: 99%

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan¹,

Harwath²

2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…In order to do that, we propose the modality matching approach. Its design is inspired by the cross-modal grounding methods [14,15] and the Barlow twins loss [16]. First, we construct a speech-text correlation matrix:…”

Section: -mentioning

confidence: 99%

Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning

Denisov

2020

Interspeech 2020

View full text Add to dashboard Cite

A number of methods have been proposed for End-to-End Spoken Language Understanding (E2E-SLU) using pretrained models, however their evaluation often lacks multilingual setup and tasks that require prediction of lexical fillers, such as slot filling. In this work, we propose a unified method that integrates multilingual pretrained speech and text models and performs E2E-SLU on six datasets in four languages in a generative manner, including the prediction of lexical fillers. We investigate how the proposed method can be improved by pretraining on widely available speech recognition data using several training objectives. Pretraining on 7000 hours of multilingual data allows us to outperform the state-of-the-art ultimately on two SLU datasets and partly on two more SLU datasets. Finally, we examine the crosslingual capabilities of the proposed model and improve on the best known result on the PortMEDIA-Language dataset by almost half, achieving a Concept/Value Error Rate of 23.65%.

show abstract

Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding

Denisov,

2023

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models

Cited by 7 publications

References 0 publications

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning

Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding

Contact Info

Product

Resources

About