2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00851
|View full text |Cite
|
Sign up to set email alerts
|

Towards VQA Models That Can Read

Abstract: Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today's VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new "TextVQA" dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 q… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

4
384
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 398 publications
(388 citation statements)
references
References 37 publications
4
384
0
Order By: Relevance
“…The ST-VQA Challenge ran between February and April 2019. Participants were provided with a training set at the beginning of March, while the test set images and questions were only made available for a two week period between [15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30] April. The participants were requested to submit results over the test set images and not executables of their systems.…”
Section: Competition Protocolmentioning
confidence: 99%
See 1 more Smart Citation
“…The ST-VQA Challenge ran between February and April 2019. Participants were provided with a training set at the beginning of March, while the test set images and questions were only made available for a two week period between [15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30] April. The participants were requested to submit results over the test set images and not executables of their systems.…”
Section: Competition Protocolmentioning
confidence: 99%
“…Interestingly, concurrently with the ST-VQA challenge, a work similar to ours introduced a new dataset [24] called Text-VQA. This work and the corresponding dataset were published while ST-VQA challenge was on-going.…”
Section: Introductionmentioning
confidence: 99%
“…Since intra-modality features can be seen as the result of sampling from the distribution along each channel, similarity scores computed over fixed distribution depict feature interactions more profoundly. Moreover, we consider that detector-based features [2,29] may fail to cover all object details, which restricts performance of captioning. Consequently, we further recommend fusing detector-based and grid-based [29] features in image encoder, which helps to enrich object representations.…”
Section: Introductionmentioning
confidence: 99%
“…Moreover, we consider that detector-based features [2,29] may fail to cover all object details, which restricts performance of captioning. Consequently, we further recommend fusing detector-based and grid-based [29] features in image encoder, which helps to enrich object representations. By combining both CW Norm and multi-level features, we construct our Relation Enhanced Transformer Block (RETB) for image feature learning.…”
Section: Introductionmentioning
confidence: 99%
“…To our best knowledge, this is the first framework that unifies the topic and sentiment understanding of ads. In particular, we first extract different types of information, such as objects and contained texts from ads using some existing techniques, such as the pre-trained object or image representation models and OCR [29,30]. To recognize and understand the visual rhetoric, an autoencoder module is introduced to decode the object representation in an unsupervised manner.…”
Section: Introductionmentioning
confidence: 99%