Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-3024
|View full text |Cite
|
Sign up to set email alerts
|

Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks

Abstract: Semantically-aligned (speech, image) datasets can be used to explore "visually-grounded speech". In a majority of existing investigations, features of an image signal are extracted using neural networks "pre-trained" on other tasks (e.g., classification on ImageNet). In still others, pre-trained networks are used to extract audio features prior to semantic embedding. Without "transfer learning" through pre-trained initialization or pretrained feature extraction, previous results have tended to show low rates o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
4
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(4 citation statements)
references
References 26 publications
0
4
0
Order By: Relevance
“…Starting from the work of Synnaeve, Versteegh, and Dupoux (2014); Harwath and Glass (2015), researchers have studied the ability of models to learn to recognize the structure of spoken language, such as words and sub-word units, by training the models to associate speech waveforms with contextually relevant visual inputs. These works have looked at a variety of tasks, such as speech-image retrieval (Harwath, Torralba, and Glass 2016;Chrupała 2019;Ilharco, Zhang, and Baldridge 2019;Mortazavi 2020;Sanabria, Waters, and Baldridge 2021), automatic speech recognition (Sun, Harwath, and Glass 2016;Palaskar, Sanabria, and Metze 2018;Hsu, Harwath, and Glass 2019), word detection and localization (Kamper et al 2017;Harwath and Glass 2017;Merkx, Frank, and Ernestus 2019;Wang and Hasegawa-Johnson 2020;Olaleye and Kamper 2021), hierarchical linguistic unit analysis (Chrupała, Gelderloos, and Alishahi 2017;Harwath, Hsu, and Glass 2020), cross-modality alignment Wang et al 2021;Khorrami and Räsänen 2021), speech segmentation , speech generation (Hsu et al 2021b), and learning multilingual speech representations (Harwath, Chuang, and Glass 2018;Kamper and Roth 2018;Havard, Chevrot, and Besacier 2020;Ohishi et al 2020). In this paper, we study the recently proposed FaST-VGS (Peng and Harwath 2021) speech-image retrieval model, and and propose a novel extention of the model that incorporates a wav2vec2.0style ) masked language modeling objective in a multi-task learning framework.…”
Section: Related Workmentioning
confidence: 99%
“…Starting from the work of Synnaeve, Versteegh, and Dupoux (2014); Harwath and Glass (2015), researchers have studied the ability of models to learn to recognize the structure of spoken language, such as words and sub-word units, by training the models to associate speech waveforms with contextually relevant visual inputs. These works have looked at a variety of tasks, such as speech-image retrieval (Harwath, Torralba, and Glass 2016;Chrupała 2019;Ilharco, Zhang, and Baldridge 2019;Mortazavi 2020;Sanabria, Waters, and Baldridge 2021), automatic speech recognition (Sun, Harwath, and Glass 2016;Palaskar, Sanabria, and Metze 2018;Hsu, Harwath, and Glass 2019), word detection and localization (Kamper et al 2017;Harwath and Glass 2017;Merkx, Frank, and Ernestus 2019;Wang and Hasegawa-Johnson 2020;Olaleye and Kamper 2021), hierarchical linguistic unit analysis (Chrupała, Gelderloos, and Alishahi 2017;Harwath, Hsu, and Glass 2020), cross-modality alignment Wang et al 2021;Khorrami and Räsänen 2021), speech segmentation , speech generation (Hsu et al 2021b), and learning multilingual speech representations (Harwath, Chuang, and Glass 2018;Kamper and Roth 2018;Havard, Chevrot, and Besacier 2020;Ohishi et al 2020). In this paper, we study the recently proposed FaST-VGS (Peng and Harwath 2021) speech-image retrieval model, and and propose a novel extention of the model that incorporates a wav2vec2.0style ) masked language modeling objective in a multi-task learning framework.…”
Section: Related Workmentioning
confidence: 99%
“…Several works [2,3,15] demonstrate the ability to learn semantic relationships between objects in images and the spoken words describing them using only the pairing between images and spoken captions as supervision. Using this framework, researchers have proposed improved image encoders, audio encoders, and loss functions [4][5][6][7][8][16][17][18][19][20]. Harwath et al [3,4,21] collected 400k spoken audio captions of images in the Places205 [22] dataset in English, which is one of the largest spoken caption datasets.…”
Section: Related Workmentioning
confidence: 99%
“…Although retrieval accuracy was often used as an evaluation benchmark to assess how well a model can predict visual semantics directly from a raw speech signal, in many cases these papers put a greater emphasis on analyzing how linguistic structure emerged within the representations learned by the model. In general, the accuracy of speech-image retrieval systems has lagged behind their text-image counterparts, but recently several works have made enormous progress towards closing this gap, demonstrating that speech-enabled image retrieval is a compelling application in its own right [20,21,22,23].…”
Section: Introduction and Related Workmentioning
confidence: 99%