2021
DOI: 10.48550/arxiv.2105.11333
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Abstract: Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training objectives. In this work we explore a broad set of multi-modal representation learning tasks in the medical domain, specifically using radiology images and the unstructured report. We propose Medical Vision Language Learner (MedViLL) which adopts a Transformer-based architecture combi… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 35 publications
0
3
0
Order By: Relevance
“…Therefore, [8] proposed to perform the pre-training on medical image-text pairs to capture medical knowledge, but their evaluation was conducted only on Med-VQA despite the promising improvement is observed. The most related work to ours is [18], which pre-trained a Med-VLP model and verified its effectiveness on various downstream tasks. Yet it is limited to the chest X-ray, and more importantly, the pre-training was not performed in a self-supervised manner (i.e., using the diagnosis labels).…”
Section: Introductionmentioning
confidence: 90%
“…Therefore, [8] proposed to perform the pre-training on medical image-text pairs to capture medical knowledge, but their evaluation was conducted only on Med-VQA despite the promising improvement is observed. The most related work to ours is [18], which pre-trained a Med-VLP model and verified its effectiveness on various downstream tasks. Yet it is limited to the chest X-ray, and more importantly, the pre-training was not performed in a self-supervised manner (i.e., using the diagnosis labels).…”
Section: Introductionmentioning
confidence: 90%
“…Another example would be language translation where GLLMMs can convert text from one language to another with greater accuracy because they have also been primed by gaining earlier access to cultural nuances and context [12]. An example where GLLMMs are applied in the field of medicine would be to analyze the results of medical imaging more accurately (such as X-ray, MRI scans, or ultrasound) by pretraining with access to additional databases that relate to the relevant histology and pathology implicated in the imaging results [13].…”
Section: A Brief Primer On the Concepts And Nomenclature Of Aimentioning
confidence: 99%
“…Recent research utilizing large data via pre-training Transformer for medical applications are mainly related to visionand-language (VL) learning. Moon et al [14] pre-trains a Transformer-based model on aligned X-ray images and associated reports for learning joint VL representations in the medical domain. The downstream tasks include both comprehension and generation tasks.…”
Section: Related Workmentioning
confidence: 99%