R-Vqa

Lu, Pan; Ji, Lei; Zhang, Wei; Duan, Nan; Zhou, Ming; Wang, Jianyong

doi:10.1145/3219819.3220036

Cited by 61 publications

(2 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Visual Genome datasets are used for various tasks, such as scene graphics, image extraction, and basic neural understanding of scenes. There is a number of datasets that are based on Visual Genome, like Relation-VQA [18] where each triplet (image, question, answer) is supported with relation fact. Fact-based Visual Question Answering (FVQA) [10] and multihop and multimodal QA (WebQA) [2] datasets use external knowledge outside of text and images.…”

Section: Related Datasetsmentioning

confidence: 99%

CLEVR-BT-DB: a benchmark dataset to evaluate the reasoning abilities of deep neural models in visual question answering problems

Latipov,

Borevskiy,

Kertesz-Farkas

2024

Fifth International Conference on Computer Vision and Computational Intelligence (CVCI 2024)

View full text Add to dashboard Cite

Deep learning-based machine reasoning and visual question answering models achieve a near-human performance on their respective datasets; however, their performance dramatically drops under domain shift suggesting that models fail to generalize to the level of human-like reasoning.In this paper we present a new CLEVR-like dataset consisting of images-question pairs to evaluate the visual reasoning capability of deep models. The objects in the images are arranged in a way that the first half of the question is ambiguous and multiple answers seem to be correct up to this point; however, the second half of the question clarifies the situation and makes the whole visual question-answering (VQA) task unambiguous, and a unique answer can be reported. Therefore, deep models during their reasoning process need to handle ambiguousness in their neurons. They can handle this either via graph (or tree) traversing in the search space with using back-tracking technique or via refining a candidate set of possibly correct answers by iteratively eliminating incorrect ones upon some reasoning calculations. We call this data-set CLEVR with Back-Tracking Database, CLEVR-BT-DB. It consists of 2,500 images and 10,000 questions in the same format as the standard CLEVR, and it is available at https://huggingface.co/datasets/Aborevsky01/CLEVR-BT-DB site. The code to generate additional data is available at https://github.com/AFigaro/CLEVR_BT_DB site. We tested MDETR method, a recent deep model for VQA from Meta Research, it achieved an accuracy of 99.7 % on the Standard CLEVR dataset; however, it achieves an accuracy of 28.01 % on our CLEVR-BT-DB dataset.

show abstract

Section: Related Datasetsmentioning

confidence: 99%

CLEVR-BT-DB: a benchmark dataset to evaluate the reasoning abilities of deep neural models in visual question answering problems

Latipov,

Borevskiy,

Kertesz-Farkas

2024

Fifth International Conference on Computer Vision and Computational Intelligence (CVCI 2024)

View full text Add to dashboard Cite

show abstract

“…With this model, the researchers demonstrated the impact of all potential combinations of recurrent and convolutional dual attention. Lu et al (2018b) proposed a novel sequential attention mechanism to seamlessly combine visual and semantic clues for VQA.…”

Section: Application Of Attention Mechanisms In Vqamentioning

confidence: 99%

The multi-modal fusion in visual question answering: a review of attention mechanisms

Liu

Yin

et al. 2023

PeerJ Computer Science

186

View full text Add to dashboard Cite

Visual Question Answering (VQA) is a significant cross-disciplinary issue in the fields of computer vision and natural language processing that requires a computer to output a natural language answer based on pictures and questions posed based on the pictures. This requires simultaneous processing of multimodal fusion of text features and visual features, and the key task that can ensure its success is the attention mechanism. Bringing in attention mechanisms makes it better to integrate text features and image features into a compact multi-modal representation. Therefore, it is necessary to clarify the development status of attention mechanism, understand the most advanced attention mechanism methods, and look forward to its future development direction. In this article, we first conduct a bibliometric analysis of the correlation through CiteSpace, then we find and reasonably speculate that the attention mechanism has great development potential in cross-modal retrieval. Secondly, we discuss the classification and application of existing attention mechanisms in VQA tasks, analysis their shortcomings, and summarize current improvement methods. Finally, through the continuous exploration of attention mechanisms, we believe that VQA will evolve in a smarter and more human direction.

show abstract

Question-Guided Hybrid Convolution for Visual Question Answering

Gao

et al. 2018

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal features. To address these problems, question-guided kernels generated from the input question are designed to convolute with visual features for capturing the textual and visual relationship in the early stage. The question-guided convolution can tightly couple the textual and visual information but also introduce more parameters when learning kernels. We apply the group convolution, which consists of question-independent kernels and question-dependent kernels, to reduce the parameter size and alleviate over-fitting. The hybrid convolution can generate discriminative multi-modal features with fewer parameters. The proposed approach is also complementary to existing bilinear pooling fusion and attention based VQA methods. By integrating with them, our method could further boost the performance. Experiments on VQA datasets validate the effectiveness of QGHC.

show abstract

R-Vqa

Cited by 61 publications

References 32 publications

CLEVR-BT-DB: a benchmark dataset to evaluate the reasoning abilities of deep neural models in visual question answering problems

CLEVR-BT-DB: a benchmark dataset to evaluate the reasoning abilities of deep neural models in visual question answering problems

The multi-modal fusion in visual question answering: a review of attention mechanisms

Question-Guided Hybrid Convolution for Visual Question Answering

Contact Info

Product

Resources

About