2018
DOI: 10.48550/arxiv.1811.00491
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Corpus for Reasoning About Natural Language Grounded in Photographs

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
63
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
9
1

Relationship

0
10

Authors

Journals

citations
Cited by 45 publications
(68 citation statements)
references
References 0 publications
0
63
0
Order By: Relevance
“…GQA [18] and NLVR2 [44], as considered in the LXMERT paper. With UNITER-backbone for DCVLP, the pre-training data is the same as for UNITER.…”
Section: Methodsmentioning
confidence: 99%
“…GQA [18] and NLVR2 [44], as considered in the LXMERT paper. With UNITER-backbone for DCVLP, the pre-training data is the same as for UNITER.…”
Section: Methodsmentioning
confidence: 99%
“…During inference, we constrain the decoder to only generate from the 3,192 candidate answers to make a fair comparison with existing methods. Natural Language for Visual Reasoning (NLVR 2 (Suhr et al, 2018)) Since the task asks the model to distinguish whether a text describes a pair of images, we follow ALBEF to extend the cross-modal encoder to enable reasoning over two images. We also perform an additional pre-training step for 1 epoch using the 4M images: given a pair of images and a text, the model needs to assign the text to either the first image, the second image, or none of them.…”
Section: B Implementation Details Of Downstream Tasksmentioning
confidence: 99%
“…Image patches and text tokens embeddings are feed into transformer or self-attention model to learn fused cross-modal attention. The great progress of these recently developed model can be witnessed on the leader boards of various tasks without using ensembling such as VQA, GAQ [37], NLVR2 [38], which can mainly be attributed to the availability of large scale weakly correlated multimodal data (typically captioned images or video clips and accompanying subtitles [39]) that can be utilised to learn cross-modal representation by contrastive learning [40]. However, existing pre-trained models use mostly scene-limited image-text pairs with short and relatively simple descriptive captions for images, while ignoring richer uni-modal text data and domainspecific information.…”
Section: Related Workmentioning
confidence: 99%