A Multimodal Interpretable Visual Question Answering Model Introducing Image Caption Processor

Zhu, He; Togo, Ren; Ogawa, Takahiro; Haseyama, Miki

doi:10.1109/gcce56475.2022.10014385

Cited by 2 publications

(3 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although the previous natural language explanation generation model can provide an explanation, it lags behind humans in terms of details and rationality. We used our pervious work [21] for the original explanation generation model in the figure.…”

Section: Figurementioning

confidence: 99%

“…In our previous works [21,22], we introduced image captions and caption-based outside knowledge as novel modalities to improve model performance. However, the generated caption in Figure 1 highlights the limitations of our previous works.…”

Section: Figurementioning

confidence: 99%

“…These methods only refer to the input image, and the model performance is limited by the lack of reference information. Therefore, our previous works [21,22] introduced image captions and outside knowledge as additional modalities to improve model performance. However, there are cases where the generated caption is invalid, which also invalidates the outside knowledge by association.…”

Section: Natural Language Explanation Generation For the Visual Quest...mentioning

confidence: 99%

See 2 more Smart Citations

Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data

et al. 2023

Self Cite

View full text Add to dashboard Cite

As deep learning research continues to advance, interpretability is becoming as important as model performance. Conducting interpretability studies to understand the decision-making processes of deep learning models can improve performance and provide valuable insights for humans. The interpretability of visual question answering (VQA), a crucial task for human–computer interaction, has garnered the attention of researchers due to its wide range of applications. The generation of natural language explanations for VQA that humans can better understand has gradually supplanted heatmap representations as the mainstream focus in the field. Humans typically answer questions by first identifying the primary objects in an image and then referring to various information sources, both within and beyond the image, including prior knowledge. However, previous studies have only considered input images, resulting in insufficient information that can lead to incorrect answers and implausible explanations. To address this issue, we introduce multiple references in addition to the input image. Specifically, we propose a multimodal model that generates natural language explanations for VQA. We introduce outside knowledge using the input image and question and incorporate object information into the model through an object detection module. By increasing the information available during the model generation process, we significantly improve VQA accuracy and the reliability of the generated explanations. Moreover, we employ a simple and effective feature fusion joint vector to combine information from multiple modalities while maximizing information preservation. Qualitative and quantitative evaluation experiments demonstrate that the proposed method can generate more reliable explanations than state-of-the-art methods while maintaining answering accuracy.

show abstract

Section: Figurementioning

confidence: 99%

Section: Figurementioning

confidence: 99%

Section: Natural Language Explanation Generation For the Visual Quest...mentioning

confidence: 99%

See 1 more Smart Citation