Meet Shah scite author profile

Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today's VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new "TextVQA" dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or is composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA) 1 . We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset. We find that the gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0. VQA ComponentSimilar to many VQA models [7,17], we first embed the question words w 1 , w 2 , . . . , w L of the question q with a pre-trained embedding function (e.g. GloVe [36]) and then encode the resultant word embeddings iteratively with a re-

show abstract

Cycle-Consistency for Robust Visual Question Answering

Shah

et al. 2019

View full text Add to dashboard Cite

Comparison of Mortality Rates and Progression of Left Ventricular Dysfunction in Patients With Idiopathic Dilated Cardiomyopathy and Dilated Versus Nondilated Right Ventricular Cavities

Sun¹,

James²,

Yang³

et al. 1997

The American Journal of Cardiology

View full text Add to dashboard Cite

MS-Net: Mixed-Supervision Fully-Convolutional Networks for Full-Resolution Segmentation

Shah

Merchant

Awate

2018

View full text Add to dashboard Cite

Conditional Entropy Coding for Efficient Video Compression

Liu¹,

Wang

Chiu

et al. 2020

View full text Add to dashboard Cite

We propose a very simple and efficient video compression framework that only focuses on modeling the conditional entropy between frames. Unlike prior learning-based approaches, we reduce complexity by not performing any form of explicit transformations between frames and assume each frame is encoded with an independent state-ofthe-art deep image compressor. We first show that a simple architecture modeling the entropy between the image latent codes is as competitive as other neural video compression works and video codecs while being much faster and easier to implement. We then propose a novel internal learning extension on top of this architecture that brings an additional ∼ 10% bitrate savings without trading off decoding speed. Importantly, we show that our approach outperforms H.265 and other deep learning baselines in MS-SSIM on higher bitrate UVG video, and against all video codecs on lower framerates, while being thousands of times faster in decoding than deep models utilizing an autoregressive entropy model.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Meet Shah

Towards VQA Models That Can Read

Cycle-Consistency for Robust Visual Question Answering

Comparison of Mortality Rates and Progression of Left Ventricular Dysfunction in Patients With Idiopathic Dilated Cardiomyopathy and Dilated Versus Nondilated Right Ventricular Cavities

MS-Net: Mixed-Supervision Fully-Convolutional Networks for Full-Resolution Segmentation

Conditional Entropy Coding for Efficient Video Compression

Contact Info

Product

Resources

About