2021
DOI: 10.48550/arxiv.2104.01894
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval

Abstract: Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself. As such, it is unclear how well speech-based retrieval can work in practice -both in an absolute sense and versus alternative strategies that combine automatic speech recognition (ASR) with strong text encoders. In this work, we extensively study and expand choices of encoder architectures, training methodology (including unimodal and multimodal pretraining), and other facto… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 26 publications
0
3
0
Order By: Relevance
“…The ResDAVEnet-VQ [27] architecture adds configurable vector quantization layers to the audio model. Several other models that learn audio-visual correspondences from both images and videos have been presented in recent work [28][29][30][31][32].…”
Section: Audio-visual Modelsmentioning
confidence: 99%
“…The ResDAVEnet-VQ [27] architecture adds configurable vector quantization layers to the audio model. Several other models that learn audio-visual correspondences from both images and videos have been presented in recent work [28][29][30][31][32].…”
Section: Audio-visual Modelsmentioning
confidence: 99%
“…Several works [2,3,15] demonstrate the ability to learn semantic relationships between objects in images and the spoken words describing them using only the pairing between images and spoken captions as supervision. Using this framework, researchers have proposed improved image encoders, audio encoders, and loss functions [4][5][6][7][8][16][17][18][19][20]. Harwath et al [3,4,21] collected 400k spoken audio captions of images in the Places205 [22] dataset in English, which is one of the largest spoken caption datasets.…”
Section: Related Workmentioning
confidence: 99%
“…The ResDAVEnet-VQ [27] architecture adds configurable vector quantization layers to the audio model. Several other models that learn audio-visual correspondences from both images and videos have been presented in recent work [28][29][30][31][32].…”
Section: Audio-visual Modelsmentioning
confidence: 99%