A survey of methods, datasets and evaluation metrics for visual question answering

Sharma, Himanshu; Jalal, Anand Singh

doi:10.1016/j.imavis.2021.104327

Cited by 30 publications

(15 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While impressive in their ability to solve traditional computer vision tasks such as detection and recognition, these models still exhibit limitations toward reasoning and inference over how people think and talk about the world. For example, VQA models are biased by how questions are asked (Sharma & Jalal, 2021) and the reasoning behind their output is often opaque (Khan et al, 2022). Thus, it is difficult to interpret the errors they make or whether their reasoning incorporates any structural elements of either individual or shared agency.…”

Section: Discussionmentioning

confidence: 99%

A Bayesian theory of mind approach to modeling cooperation and communication

Stacy,

Gong,

Parab

et al. 2023

WIREs Computational Stats

View full text Add to dashboard Cite

Language has been widely acknowledged as the benchmark of intelligence. However, evidence from cognitive science shows that intelligent behaviors in robust social interactions preexist the mastery of language. This review approaches human‐unique intelligence, specifically cooperation and communication, from an agency‐based theory of mind (ToM) account, emphasizing the ability to understand others' behaviors in terms of their underlying mental states. This review demonstrates this viewpoint by first reviewing a series of empirical works on the socio‐cognitive development of young children and non‐human primates in terms of their capacities in communication and cooperation, strongly suggesting that these capacities constitute the origin of human‐unique intelligence. Following, it reviews how ToM can be formalized as a Bayesian inference of the mental states given observed actions. Then, it reviews how Bayesian ToM can be extended to model the interaction of minds in cooperation and communication. The advantage of this approach is that non‐linguistic knowledge such as the visual environment can serve as the contextual constraint for multiple agents to coordinate with sparse and limited signals, thus demonstrating certain cognitive architectures underlying human communication.This article is categorized under: Applications of Computational Statistics > Psychometrics Statistical Models > Bayesian Models Statistical Models > Agent‐Based Models

show abstract

Section: Discussionmentioning

confidence: 99%

A Bayesian theory of mind approach to modeling cooperation and communication

Stacy,

Gong,

Parab

et al. 2023

WIREs Computational Stats

View full text Add to dashboard Cite

show abstract

“…7 Donahue et al 9 presented a recurrent convolutional architecture that offered simultaneous learning of temporal dynamics and convolutional perceptual representations. In this sequence, Yang and Xu 12 proposed a visual question answering (VQA) [17][18][19] -based caption generation model to understand the image content in a deeper way using the knowledge learned from the VQA algorithm by asking questions about a given image.…”

Section: Cnn-based Methodsmentioning

confidence: 99%

Graph neural network-based visual relationship and multilevel attention for image captioning

Sharma

Srivastava

2022

J. Electron. Imag.

Self Cite

View full text Add to dashboard Cite

With the remarkable success of the image captioning tasks, visual attention methods have become a vital part of captioning models. However, most attention-based image captioning methods do not consider any relationship among regions, which play a significant role in better image understanding. We proposed an image captioning method based on local relation network using a multilevel attention approach with graph neural network. It not only fully explores the relationship between the object and the image regions but also generates significant and contextbased features corresponding to every region in the image. The attention employed in our work enhances the image representation capability of our method by focusing on a given image region and its related image regions. Thus addressing the relevant contextual information, spatial locations, and deep visual features leads to improve caption generation. We verified the effectiveness of the proposed model by conducting extensive experiments on three benchmark datasets: Flickr30k, MSCOCO, and nocaps. The results show the superiority of the proposed method over the existing methods both in quantitative and qualitative manners. Detailed ablation studies are conducted to communicate how each part would contribute to the final performance.

show abstract

“…Other popular VQA datasets are Flickr30k-Entities [26], COCO-QA [27], Visual7W [28] and others. For a detailed analysis on VQA and relevant topics we refer readers to recent specialized survey papers [1,2,3,29,30,31].…”

Section: Related Workmentioning

confidence: 99%

Knowledge-Based Counterfactual Queries for Visual Question Answering

Stoikou¹,

Lymperaiou²,

Stamou³

2023

Preprint

View full text Add to dashboard Cite

Visual Question Answering (VQA) has been a popular task that combines vision and language, with numerous relevant implementations in literature. Even though there are some attempts that approach explainability and robustness issues in VQA models, very few of them employ counterfactuals as a means of probing such challenges in a model-agnostic way. In this work, we propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations. For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality, and we then evaluate the model's response against such counterfactual inputs. Finally, we qualitatively extract local and global explanations based on counterfactual responses, which are ultimately proven insightful towards interpreting VQA model behaviors. By performing a variety of perturbation types, targeting different parts of speech of the input question, we gain insights to the reasoning of the model, through the comparison of its responses in different adversarial circumstances. Overall, we reveal possible biases in the decision-making process of the model, as well as expected and unexpected patterns, which impact its performance quantitatively and qualitatively, as indicated by our analysis.

show abstract

A survey of methods, datasets and evaluation metrics for visual question answering

Cited by 30 publications

References 36 publications

A Bayesian theory of mind approach to modeling cooperation and communication

A Bayesian theory of mind approach to modeling cooperation and communication

Graph neural network-based visual relationship and multilevel attention for image captioning

Knowledge-Based Counterfactual Queries for Visual Question Answering

Contact Info

Product

Resources

About