“…Starting from the work of Synnaeve, Versteegh, and Dupoux (2014); Harwath and Glass (2015), researchers have studied the ability of models to learn to recognize the structure of spoken language, such as words and sub-word units, by training the models to associate speech waveforms with contextually relevant visual inputs. These works have looked at a variety of tasks, such as speech-image retrieval (Harwath, Torralba, and Glass 2016;Chrupała 2019;Ilharco, Zhang, and Baldridge 2019;Mortazavi 2020;Sanabria, Waters, and Baldridge 2021), automatic speech recognition (Sun, Harwath, and Glass 2016;Palaskar, Sanabria, and Metze 2018;Hsu, Harwath, and Glass 2019), word detection and localization (Kamper et al 2017;Harwath and Glass 2017;Merkx, Frank, and Ernestus 2019;Wang and Hasegawa-Johnson 2020;Olaleye and Kamper 2021), hierarchical linguistic unit analysis (Chrupała, Gelderloos, and Alishahi 2017;Harwath, Hsu, and Glass 2020), cross-modality alignment Wang et al 2021;Khorrami and Räsänen 2021), speech segmentation , speech generation (Hsu et al 2021b), and learning multilingual speech representations (Harwath, Chuang, and Glass 2018;Kamper and Roth 2018;Havard, Chevrot, and Besacier 2020;Ohishi et al 2020). In this paper, we study the recently proposed FaST-VGS (Peng and Harwath 2021) speech-image retrieval model, and and propose a novel extention of the model that incorporates a wav2vec2.0style ) masked language modeling objective in a multi-task learning framework.…”