Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.278
|View full text |Cite
|
Sign up to set email alerts
|

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

Abstract: Image text carries essential information to understand the scene and perform reasoning. Textbased visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localizationaware answer prediction network (LaAP… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 40 publications
(15 citation statements)
references
References 16 publications
0
15
0
Order By: Relevance
“…TextVQA attracts more and more attention from communities since it focuses on texts in natural daily scenes, such as road signs and displays. To promote progress in this field, several datasets [4,11,12] and methods [8,4,13,5] have been proposed. LoRRA [4] is the pioneering model which utilizes an attention mechanism to handle image features, OCR features, and question words, but it can only select one word as the answer.…”
Section: Text-based Visual Question Answeringmentioning
confidence: 99%
See 2 more Smart Citations
“…TextVQA attracts more and more attention from communities since it focuses on texts in natural daily scenes, such as road signs and displays. To promote progress in this field, several datasets [4,11,12] and methods [8,4,13,5] have been proposed. LoRRA [4] is the pioneering model which utilizes an attention mechanism to handle image features, OCR features, and question words, but it can only select one word as the answer.…”
Section: Text-based Visual Question Answeringmentioning
confidence: 99%
“…The Multimodal Multi-Copy Mesh (M4C) model [5] boosts TextVQA performance by employing a multimodal transformer [6] to fuse various modality entities. In the following works, modifications with respect to feature embedding [9], feature interaction [14,7,15] and answer decoding [8] have been shown to improve performance. In addition, a pre-training method TAP [10] is proposed on the TextVQA and TextCaption tasks.…”
Section: Text-based Visual Question Answeringmentioning
confidence: 99%
See 1 more Smart Citation
“…This line of research has been pursued by several studies, particularly thanks to the introduction of the Fact‐based VQA dataset (Wang et al., 2017b). The VQA task is now taking new directions, such as embodied approaches where an agent has to navigate an environment and answer questions about it (H. Chen et al., 2019; Das et al., 2018); video VQA, where the answer has to be found in videos rather than in static images (Lei et al., 2018, 2020); answering questions about diagrams and charts (Ebrahimi Kahou et al., 2017; Kafle et al., 2018); text VQA, which involves recognizing and interpreting textual content in images (Biten et al., 2019; Han et al., 2020); answering questions about medical images (see, Abacha et al., 2020); and many others.…”
Section: The Recent Revival Of Vqamentioning
confidence: 99%
“…On TextCaps, M4C can be adapted to generate a sentence by taking previously generated words as text inputs at each time step. Multiple models have been introduced recently which ablate various components of M4C for better accuracy [24,13,16,22]. Contrary to M4C and derivative works which treat OCR as a black box, in PixelM4C, we train an end-to-end model and use this capability to apply new design choices in a more informed way.…”
Section: Downstream Application Modelsmentioning
confidence: 99%