Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) 2019
DOI: 10.18653/v1/k19-1006
|View full text |Cite
|
Sign up to set email alerts
|

Large-Scale Representation Learning from Visually Grounded Untranscribed Speech

Abstract: Systems that can associate images with their spoken audio captions are an important step towards visually grounded language learning. We describe a scalable method to automatically generate diverse audio for image captioning datasets. This supports pretraining deep networks for encoding both audio and images, which we do via a dual encoder that learns to align latent representations from both modalities. We show that a masked margin softmax loss for such models is superior to the standard triplet loss. We fine… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
62
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 58 publications
(62 citation statements)
references
References 56 publications
0
62
0
Order By: Relevance
“…As text pretraining schemes seem to be reaching the point of diminishing returns, even for some syntactic phenomena (van Schijndel et al, 2019), we posit that other forms of supervision, such as multimodal perception (Ilharco et al, 2019), are necessary to learn the remaining aspects of meaning in context. Learning by observation should not be a purely linguistic process, since leveraging and combining the patterns of multimodal perception can combinatorially boost the amount of signal in data through cross-referencing and synthesis.…”
Section: Ws2: the Written Worldmentioning
confidence: 99%
“…As text pretraining schemes seem to be reaching the point of diminishing returns, even for some syntactic phenomena (van Schijndel et al, 2019), we posit that other forms of supervision, such as multimodal perception (Ilharco et al, 2019), are necessary to learn the remaining aspects of meaning in context. Learning by observation should not be a purely linguistic process, since leveraging and combining the patterns of multimodal perception can combinatorially boost the amount of signal in data through cross-referencing and synthesis.…”
Section: Ws2: the Written Worldmentioning
confidence: 99%
“…Possible directions for future work include: (1) OCR postprocessing improvement: spelling correction, noise removal, etc. ; (2) enhancing text-based prediction and/or object detection using thesauri, knowledge graphs, e.g., replacing specific entities with hypernyms similar to Ilharco et al (2019) or word associations datasets, e.g., Wordgame (Louwe, 2020); (3) developing a joint architecture for images and texts: obviously, simple blending might not be the best choice for the task.…”
Section: Discussionmentioning
confidence: 99%
“…Instead, the supervisory information consists of the corresponding images of the speech descriptions. Inspired by human infants' ability to learn spoken language by listening and paying attention to the concurrent speech and visual scenes, several recent methods [15], [36], [37], [38], [39], [40] have been proposed to learn speech models grounded by visual information.…”
Section: B Visually-grounded Speech Embedding Learningmentioning
confidence: 99%