Towards VQA Models That Can Read

Singh, Amanpreet; Natarajan, Vivek; Shah, Meet; Jiang, Yu; Chen, Xinlei; Batra, Dhruv; Parikh, Devi; Rohrbach, Marcus

doi:10.1109/cvpr.2019.00851

Cited by 398 publications

(388 citation statements)

References 37 publications

Supporting

Mentioning

384

Contrasting

Order By: Relevance

“…The ST-VQA Challenge ran between February and April 2019. Participants were provided with a training set at the beginning of March, while the test set images and questions were only made available for a two week period between [15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30] April. The participants were requested to submit results over the test set images and not executables of their systems.…”

Section: Competition Protocolmentioning

confidence: 99%

See 1 more Smart Citation

ICDAR 2019 Competition on Scene Text Visual Question Answering

Biten

Tito

Mafla

et al. 2019

2019 International Conference on Document Analysis and Recognition (ICDAR)

View full text Add to dashboard Cite

This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image. The competition introduces a new dataset comprising 23, 038 images annotated with 31, 791 question / answer pairs where the answer is always grounded on text instances present in the image. The images are taken from 7 different public computer vision datasets, covering a wide range of scenarios.The competition was structured in three tasks of increasing difficulty, that require reading the text in a scene and understanding it in the context of the scene, to correctly answer a given question. A novel evaluation metric is presented, which elegantly assesses both key capabilities expected from an optimal model: text recognition and image understanding.A detailed analysis of results from different participants is showcased, which provides insight into the current capabilities of VQA systems that can read. We firmly believe the dataset proposed in this challenge will be an important milestone to consider towards a path of more robust and general models that can exploit scene text to achieve holistic image understanding.

show abstract

Section: Competition Protocolmentioning

confidence: 99%

“…Interestingly, concurrently with the ST-VQA challenge, a work similar to ours introduced a new dataset [24] called Text-VQA. This work and the corresponding dataset were published while ST-VQA challenge was on-going.…”

Section: Introductionmentioning

confidence: 99%

ICDAR 2019 Competition on Scene Text Visual Question Answering

Biten

Tito

Mafla

et al. 2019

2019 International Conference on Document Analysis and Recognition (ICDAR)

View full text Add to dashboard Cite

show abstract

“…Since intra-modality features can be seen as the result of sampling from the distribution along each channel, similarity scores computed over fixed distribution depict feature interactions more profoundly. Moreover, we consider that detector-based features [2,29] may fail to cover all object details, which restricts performance of captioning. Consequently, we further recommend fusing detector-based and grid-based [29] features in image encoder, which helps to enrich object representations.…”

Section: Introductionmentioning

confidence: 99%

“…Moreover, we consider that detector-based features [2,29] may fail to cover all object details, which restricts performance of captioning. Consequently, we further recommend fusing detector-based and grid-based [29] features in image encoder, which helps to enrich object representations. By combining both CW Norm and multi-level features, we construct our Relation Enhanced Transformer Block (RETB) for image feature learning.…”

Section: Introductionmentioning

confidence: 99%

Improving Intra- and Inter-Modality Visual Relation for Image Captioning

Wang

Zhang

Liu

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

It is widely shared that capturing relationships among multi-modality features would be helpful for representing and ultimately describing an image. In this paper, we present a novel Intra-and Inter-modality visual Relation Transformer to improve connections among visual features, termed 2. Firstly, we propose Relation Enhanced Transformer Block (RETB) for image feature learning, which strengthens intra-modality visual relations among objects. Moreover, to bridge the gap between inter-modality feature representations, we align them explicitly via Visual Guided Alignment (VGA) module. Finally, an end-to-end formulation is adopted to train the whole model jointly. Experiments on the MS-COCO dataset show the effectiveness of our model, leading to improvements on all commonly used metrics on the "Karpathy" test split. Extensive ablation experiments are conducted for the comprehensive analysis of the proposed method. CCS CONCEPTS • Computing methodologies → Image representations; Natural language generation.

show abstract

“…To our best knowledge, this is the first framework that unifies the topic and sentiment understanding of ads. In particular, we first extract different types of information, such as objects and contained texts from ads using some existing techniques, such as the pre-trained object or image representation models and OCR [29,30]. To recognize and understand the visual rhetoric, an autoencoder module is introduced to decode the object representation in an unsupervised manner.…”

Section: Introductionmentioning

confidence: 99%

Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning

Zhang

Luo

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Given the massive market of advertising and the sharply increasing online multimedia content (such as videos), it is now fashionable to promote advertisements (ads) together with the multimedia content. However, manually finding relevant ads to match the provided content is labor-intensive, and hence some automatic advertising techniques are developed. Since ads are usually hard to understand only according to its visual appearance due to the contained visual metaphor, some other modalities, such as the contained texts, should be exploited for understanding. To further improve user experience, it is necessary to understand both the ads' topic and sentiment. This motivates us to develop a novel deep multimodal multitask framework that integrates multiple modalities to achieve effective topic and sentiment prediction simultaneously for ads understanding. In particular, in our framework termed Deep 2 Ad, we first extract multimodal information from ads and learn high-level and comparable representations. The visual metaphor of the ad is decoded in an unsupervised manner. The obtained representations are then fed into the proposed hierarchical multimodal attention modules to learn task-specific representations for final prediction. A multitask loss function is also designed to jointly train both the topic and sentiment prediction models in an end-to-end manner, where bottom-layer parameters are shared to alleviate over-fitting. We conduct extensive experiments on a large-scale advertisement dataset and achieve state-of-the-art performance for both prediction tasks. The obtained results could be utilized as a benchmark for ads understanding.

show abstract

Towards VQA Models That Can Read

Cited by 398 publications

References 37 publications

ICDAR 2019 Competition on Scene Text Visual Question Answering

ICDAR 2019 Competition on Scene Text Visual Question Answering

Improving Intra- and Inter-Modality Visual Relation for Image Captioning

Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning

Contact Info

Product

Resources

About