2023
DOI: 10.3390/bioengineering10030380
|View full text |Cite
|
Sign up to set email alerts
|

Vision–Language Model for Visual Question Answering in Medical Imagery

Abstract: In the clinical and healthcare domains, medical images play a critical role. A mature medical visual question answering system (VQA) can improve diagnosis by answering clinical questions presented with a medical image. Despite its enormous potential in the healthcare industry and services, this technology is still in its infancy and is far from practical use. This paper introduces an approach based on a transformer encoder–decoder architecture. Specifically, we extract image features using the vision transform… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 21 publications
(3 citation statements)
references
References 74 publications
0
3
0
Order By: Relevance
“…Visual question answering (VQA) has emerged as a key task within multimodal research, marking a foundational step toward the realization of true artificial intelligence entities. This study explores modal fusion methods in VQA contexts [ 28 , 29 , 30 ] and suggests that similar approaches could be beneficial for other multimodal tasks, such as image captioning, especially in identifying biases. Long-tail distributions in answer datasets and biases due to missing modal information in images represent unavoidable challenges in VQA development.…”
Section: Discussionmentioning
confidence: 99%
“…Visual question answering (VQA) has emerged as a key task within multimodal research, marking a foundational step toward the realization of true artificial intelligence entities. This study explores modal fusion methods in VQA contexts [ 28 , 29 , 30 ] and suggests that similar approaches could be beneficial for other multimodal tasks, such as image captioning, especially in identifying biases. Long-tail distributions in answer datasets and biases due to missing modal information in images represent unavoidable challenges in VQA development.…”
Section: Discussionmentioning
confidence: 99%
“…Figure 9 shows an illustration of utilizing Vision Transformer for VQA. The work of (46) discusses a transformer based VQA system for medical images, utilizing the ViT and textual encoder transformer. The model exhibits good results on two VQA datasets comprising radiology images.…”
Section: Vision Transformers For Visual Question Answeringmentioning
confidence: 99%
“…Seeing is Knowing (106) , MULAN (107) Faster R-CNN with ResNet-101 GAT (108) , ATH (109) , DMMGR (24) , MCLN (110) , MCAN (111) , F-SWAP (112) , SRRN (35) , TVQA (113) Faster R-CNN with Resnet-152 RA-MAP (114) , MASN (115) , Anamoly based (114) , Vocab based (116) , DA-Net (117) ResNet CNN within Faster R-CNN MuVAM (118) FasterR-CNN with ResNext-152 CBM (119) RCNN (120) Multi-image (89) VGGNet (121) VQA-AID (122) EfficientNetV2 (123) RealFormer (124) YOLO (125) Scene Text VQA (126) CLIPViT-B CCVQA (14) Resnet NFNet (127) Flamingo (128) ViT (129) VLMmed (46) , ConvS2S+ViT (130) , BMT (10) , M2I2 (52) XCLIP with ViT-L/14 CMQR (32) RsNet18, Swin, ViT LV-GPT (43) GLIP (131) REVIVE (132) CLIP (133) KVQAE (30) 2.6.4 VGGNet (121) VGGNet (Visual Geometry Group Network) is a CNN with a small number of layers, achieving good performance in image classification tasks. It is basically known for its simplicity and generalizability to new datasets.…”
Section: Faster Rcnnmentioning
confidence: 99%