Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-496
|View full text |Cite
|
Sign up to set email alerts
|

Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(2 citation statements)
references
References 0 publications
0
2
0
Order By: Relevance
“…VGS usually leverages image-speech [25,26] or video-speech [27,28] paired data. In practice, besides speech-image retrieval and alignment [29,30,31,32,33,34], VGS models has also be shown to achieves competitive performance keyword spotting [35], query-by-example research [36], and varies tasks in the SU-PERB benchmark [37,38]. The study of linguistic information learned in VGS models has been attracting increasing attention.…”
Section: Related Workmentioning
confidence: 99%
“…VGS usually leverages image-speech [25,26] or video-speech [27,28] paired data. In practice, besides speech-image retrieval and alignment [29,30,31,32,33,34], VGS models has also be shown to achieves competitive performance keyword spotting [35], query-by-example research [36], and varies tasks in the SU-PERB benchmark [37,38]. The study of linguistic information learned in VGS models has been attracting increasing attention.…”
Section: Related Workmentioning
confidence: 99%
“…In order to do that, we propose the modality matching approach. Its design is inspired by the cross-modal grounding methods [14,15] and the Barlow twins loss [16]. First, we construct a speech-text correlation matrix:…”
Section: -mentioning
confidence: 99%