2023
DOI: 10.3390/electronics12102183
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data

Abstract: As deep learning research continues to advance, interpretability is becoming as important as model performance. Conducting interpretability studies to understand the decision-making processes of deep learning models can improve performance and provide valuable insights for humans. The interpretability of visual question answering (VQA), a crucial task for human–computer interaction, has garnered the attention of researchers due to its wide range of applications. The generation of natural language explanations … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 59 publications
0
2
0
Order By: Relevance
“…Visual question answering (VQA) has emerged as a key task within multimodal research, marking a foundational step toward the realization of true artificial intelligence entities. This study explores modal fusion methods in VQA contexts [ 28 , 29 , 30 ] and suggests that similar approaches could be beneficial for other multimodal tasks, such as image captioning, especially in identifying biases. Long-tail distributions in answer datasets and biases due to missing modal information in images represent unavoidable challenges in VQA development.…”
Section: Discussionmentioning
confidence: 99%
“…Visual question answering (VQA) has emerged as a key task within multimodal research, marking a foundational step toward the realization of true artificial intelligence entities. This study explores modal fusion methods in VQA contexts [ 28 , 29 , 30 ] and suggests that similar approaches could be beneficial for other multimodal tasks, such as image captioning, especially in identifying biases. Long-tail distributions in answer datasets and biases due to missing modal information in images represent unavoidable challenges in VQA development.…”
Section: Discussionmentioning
confidence: 99%
“…Phrase comprehension (PC) is a fundamental task in the multi-modal learning community and serves as the basis for many downstream tasks, including image captioning [1,2], visual question answering [3,4], etc. The purpose of PC is to locate a specific entity in an image according to a given linguistic query.…”
Section: Introductionmentioning
confidence: 99%