2020
DOI: 10.1007/978-3-030-58545-7_41
|View full text |Cite
|
Sign up to set email alerts
|

Spatially Aware Multimodal Transformers for TextVQA

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
71
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 80 publications
(72 citation statements)
references
References 29 publications
1
71
0
Order By: Relevance
“…Attention models introduced in Xu, Ba, Kiros, Cho, et al (2015); further improved performance in image captioning and were refined in bottom-up and top-down attention models (Anderson et al, 2018a). Transformers models (Vaswani et al, 2017) have been adapted to multimodal scenarios, such as image captioning and visual question answering (VQA) in works like Kant et al (2020) and Luo et al (2019), which won the conceptual captions challenge on GCC dataset in 2019 (Sharma et al, 2018). Generic image captioning systems were trained on MS-COCO or GCC benchmark using cross-entropy training.…”
Section: Related Workmentioning
confidence: 99%
“…Attention models introduced in Xu, Ba, Kiros, Cho, et al (2015); further improved performance in image captioning and were refined in bottom-up and top-down attention models (Anderson et al, 2018a). Transformers models (Vaswani et al, 2017) have been adapted to multimodal scenarios, such as image captioning and visual question answering (VQA) in works like Kant et al (2020) and Luo et al (2019), which won the conceptual captions challenge on GCC dataset in 2019 (Sharma et al, 2018). Generic image captioning systems were trained on MS-COCO or GCC benchmark using cross-entropy training.…”
Section: Related Workmentioning
confidence: 99%
“…Inspired by [14,20,43], we calculate the similarity between each paired regions by their Intersection over Union (IoU) score. The region pairs with IoU scores larger than zero are considered to have edges in E and their IoU scores are regarded as their similarities in S. For the text graph, we use an off-the-shelf scene graph parser provided by [1] to obtain a text scene graph from a text.…”
Section: Knowledge Extractionmentioning
confidence: 99%
“…Some recent works construct graphs for visual or linguistic elements in V+L tasks, such as VQA [16,27,43], VideoQA [28,30,78], Image Captioning [23,69,75], and Visual Grounding [31,47,68], to reveal relationships between these elements and obtain fine-grained semantic representations. These constructed graphs can be broadly grouped into three types: visual graphs between image objects/regions (e.g., [69]), linguistic graphs between sentence elements/tokens (e.g., [33]), and crossmodal graphs among visual and linguistic elements (e.g., [47]). In this work, we construct the visual graph for X-GGM.…”
Section: Graph Construction In V+l Tasksmentioning
confidence: 99%