2023
DOI: 10.1101/2023.01.10.23284412
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training

Abstract: In this paper, we consider the problem of enhancing self-supervised visual-language pre-training (VLP) with medical-specific knowledge, by exploiting the paired image-text reports from the radiological daily practice. In particular, we make the following contributions: First, unlike existing works that directly process the raw reports, we adopt a novel report filter to extract the medical entities, avoiding unnecessary complexity from language grammar and enhancing the supervision signals; Second, we propose a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

2
18
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 30 publications
(20 citation statements)
references
References 43 publications
2
18
0
Order By: Relevance
“…Moreover, REFERS introduced a multi-view fusion attention to better align the representations of each radiograph and its associated report. In addition, MGCA and Med-KLIP (Wu et al, 2023) were included as two recent baselines 4 . Apart from above baselines, we also include M3AE (Geng et al, 2022), a recent masked multi-modal pre-training method aside from the application to medical data, for comparison.…”
Section: Report-supervised Methodologiesmentioning
confidence: 99%
See 1 more Smart Citation
“…Moreover, REFERS introduced a multi-view fusion attention to better align the representations of each radiograph and its associated report. In addition, MGCA and Med-KLIP (Wu et al, 2023) were included as two recent baselines 4 . Apart from above baselines, we also include M3AE (Geng et al, 2022), a recent masked multi-modal pre-training method aside from the application to medical data, for comparison.…”
Section: Report-supervised Methodologiesmentioning
confidence: 99%
“…MGCA and MedKLIP(Wu et al, 2023) were released after the submission deadline of ICLR 2023. We added results in the camera ready for better comparisons.…”
mentioning
confidence: 99%
“…In general, most existing Med-VLP models can be classified into two types: the dual-encoder type and the fusion-encoder type, where the former encodes images and texts separately to learn uni-modal/cross-modal representations following a shallow interaction layer (i.e., an image-text contrastive layer), and the latter performs an early fusion of the two modalities through the selfattention/co-attention mechanisms to learn multi-modal representations. 2 For dual-encoders, the purpose of existing studies [66,19,44,60,57,61,3] is to develop label-efficient algorithms to learn effective uni-modal/cross-modal representations since large-scale manually labeled datasets are difficult and expensive to obtain for medical images. The learned representations can improve the effectiveness of uni-modal (i.e., vision-only or language-only) tasks 3 and the efficiency of crossmodal (i.e., image-to-text or text-to-image) retrieval tasks significantly.…”
Section: Introductionmentioning
confidence: 99%
“…However, existing methods are still limited in the following aspects: 1) Model redundancy caused by multiple branches or multiple downstream heads. They tend to treat the unified model as a feature extractor to extract common features, and add different network branches for the specific tasks [7,13,17,[21][22][23]. For example, there are works using a large language model (i.e.…”
Section: Introductionmentioning
confidence: 99%