2021
DOI: 10.48550/arxiv.2111.10056
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Medical Visual Question Answering: A Survey

Abstract: Medical Visual Question Answering (VQA) is a combination of medical artificial intelligence and popular VQA challenges. Given a medical image and a clinically relevant question in natural language, the medical VQA system is expected to predict a plausible and convincing answer. Although the general-domain VQA has been extensively studied, the medical VQA still needs specific investigation and exploration due to its task features. In the first part of this survey, we cover and discuss the publicly available med… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(14 citation statements)
references
References 39 publications
0
14
0
Order By: Relevance
“…The combination of different modalities of healthcare data, each providing information about a patient's treatment from a specific perspective, overlays and complements each other to further improve the accuracy of diagnosis and treatment. For example, the visual quest answering (VQA) task [6] combines computer vision (CV) and natural language processing (NLP), and the model can answer relevant questions based on medical images and clinical notes [60]. However, multi-modal models face more serious bias and fairness issues than unimodal models, despite improvements in performance [12].…”
Section: Fairness Of Multi-modality Model For Healthcarementioning
confidence: 99%
“…The combination of different modalities of healthcare data, each providing information about a patient's treatment from a specific perspective, overlays and complements each other to further improve the accuracy of diagnosis and treatment. For example, the visual quest answering (VQA) task [6] combines computer vision (CV) and natural language processing (NLP), and the model can answer relevant questions based on medical images and clinical notes [60]. However, multi-modal models face more serious bias and fairness issues than unimodal models, despite improvements in performance [12].…”
Section: Fairness Of Multi-modality Model For Healthcarementioning
confidence: 99%
“…A number of works have also attempted to decompose the multi-hop questions into single hop questions or generate follow-up questions based on the retrieved information [14,95,102,138,181]. single-hop QA [2,12,33,60,103,132], open-domain QA [184], medical QA [68,88] etc. The surveys that are most relevant to MHQA are the ones focused on QA over knowledge bases [32,43,82] and visual QA [88,135,164].…”
Section: ♂ Available Context -B's Father Is C and Her Mother Is Amentioning
confidence: 99%
“…single-hop QA [2,12,33,60,103,132], open-domain QA [184], medical QA [68,88] etc. The surveys that are most relevant to MHQA are the ones focused on QA over knowledge bases [32,43,82] and visual QA [88,135,164]. However, these can be considered as sub-domains of the more general formulation of MHQA field that this manuscript aims to survey.…”
Section: ♂ Available Context -B's Father Is C and Her Mother Is Amentioning
confidence: 99%
“…• We propose a novel pipeline to reinforce the text bias after fusing multimodal features, combined with dynamic attention, named the reinforCe unimOdal dynamiC Attention model (COCA), which can be applied universally for visual question answering tasks. • To the best of our knowledge, we are the first to question an overall accepted 'myth' that unimodal biases in medical VQA should be avoided [7] and, moreover, prove that adding unimodal bias after feature fusion under certain conditions can improve the prediction accuracy under specific circumstances. • Experimental results on a real-world dataset show the superior performance of the proposed COCA model compared with the state-of-the-art.…”
Section: Introductionmentioning
confidence: 96%
“…However, there are cases when reducing unimodal bias will not benefit model performance for other textually biased VQA datasets [15]. Due to the small scale of VQA datasets in the medical domain [7], this is less covered and some questions indeed do not need to view the related image before answering. In such cases, reinforcing unimodal biases can improve prediction performance, particularly after fusing multimodal features, which we study in this paper.…”
Section: Introductionmentioning
confidence: 99%