Towards VQA Models That Can Read

Singh, Amanpreet; Vivek, Natarajan; Meet, Shah; Jiang, Yu; Chen, Xinlei; Batra, Dhruv; Devi, P. Shobana; Rohrbach, Marcus

doi:10.48550/arxiv.1904.08920

Cited by 6 publications

(7 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…and ask about objects, instead of asking "where" and "why" questions. While answering questions about text in images is currently an open research problem known as TextVQA [38,41,68,77], inspired by this statistic, we augment our descriptions with text extracted from video frames.…”

Section: Toward Automated Question Answeringmentioning

confidence: 99%

NarrationBot and InfoBot: A Hybrid System for Automated Video Description

Ihorn¹,

Siu²,

Bodi³

et al. 2021

Preprint

View full text Add to dashboard Cite

Video accessibility is crucial for blind and low vision users for equitable engagements in education, employment, and entertainment. Despite the availability of professional and amateur services and tools, most human-generated descriptions are expensive and time consuming. Moreover, the rate of human-generated descriptions cannot match the speed of video production. To overcome the increasing gaps in video accessibility, we developed a hybrid system of two tools to 1) automatically generate descriptions for videos and 2) provide answers or additional descriptions in response to user queries on a video. Results from a mixed-methods study with 26 blind and low vision individuals show that our system significantly improved user comprehension and enjoyment of selected videos when both tools were used in tandem. In addition, participants reported no significant difference in their ability to understand videos when presented with autogenerated descriptions versus humanrevised autogenerated descriptions. Our results demonstrate user enthusiasm about the developed system and its promise for providing customized access to videos. We discuss the limitations of the current work and provide recommendations for the future development of automated video description tools. CCS CONCEPTS• Human-centered computing → Accessibility technologies; Accessibility systems and tools.

show abstract

Section: Toward Automated Question Answeringmentioning

confidence: 99%

NarrationBot and InfoBot: A Hybrid System for Automated Video Description

Ihorn¹,

Siu²,

Bodi³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Bottom-up-attention features are proposed by [1] who won the first place in the 2017 VQA Challenge. Pythia features are provided by [12], who is the VQA 2018 challenge winner. As we see in Tab 1, Pythia features perform better than bottom-up-attention features, and they have a significant gain than object features for about 3%.…”

Section: Multi-source Image Features 211 Incorporating Better Detecti...mentioning

confidence: 99%

“…Followed by the early works like [1,12], we use the common practice of ensembling several models to obtain better performance. We choose the best ones of all settings above and try different weights when summing the prediction scores.…”

Section: Weighted Ensemblementioning

confidence: 99%

Deep Reason: A Strong Baseline for Real-World Visual Reasoning

Wu¹,

Zhou²,

Li³

et al. 2019

Preprint

View full text Add to dashboard Cite

This paper presents a strong baseline for real-world visual reasoning (GQA), which achieves 60.93% in GQA 2019 challenge and won the sixth place. GQA is a large dataset with 22M questions involving spatial understanding and multi-step inference. To help further research in this area, we identified three crucial parts that improve the performance, namely: multi-source features, fine-grained encoder, and score-weighted ensemble. We provide a series of analysis on their impact on performance.

show abstract

“…Interestingly, concurrently with the ST-VQA challenge, a work similar to ours introduced a new dataset [24] called Text-VQA. This work and the corresponding dataset were published while ST-VQA challenge was on-going.…”

Section: Introductionmentioning

confidence: 99%

ICDAR 2019 Competition on Scene Text Visual Question Answering

Biten

Tito

Mafla

et al. 2019

2019 International Conference on Document Analysis and Recognition (ICDAR)

View full text Add to dashboard Cite

This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image. The competition introduces a new dataset comprising 23, 038 images annotated with 31, 791 question / answer pairs where the answer is always grounded on text instances present in the image. The images are taken from 7 different public computer vision datasets, covering a wide range of scenarios.The competition was structured in three tasks of increasing difficulty, that require reading the text in a scene and understanding it in the context of the scene, to correctly answer a given question. A novel evaluation metric is presented, which elegantly assesses both key capabilities expected from an optimal model: text recognition and image understanding.A detailed analysis of results from different participants is showcased, which provides insight into the current capabilities of VQA systems that can read. We firmly believe the dataset proposed in this challenge will be an important milestone to consider towards a path of more robust and general models that can exploit scene text to achieve holistic image understanding.

show abstract

Towards VQA Models That Can Read

Cited by 6 publications

References 0 publications

NarrationBot and InfoBot: A Hybrid System for Automated Video Description

NarrationBot and InfoBot: A Hybrid System for Automated Video Description

Deep Reason: A Strong Baseline for Real-World Visual Reasoning

ICDAR 2019 Competition on Scene Text Visual Question Answering

Contact Info

Product

Resources

About