2020
DOI: 10.48550/arxiv.2010.00747
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Contrastive Learning of Medical Visual Representations from Paired Images and Text

Abstract: Learning visual representations of medical images is core to medical image understanding but its progress has been held back by the small size of hand-labeled datasets. Existing work commonly relies on transferring weights from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. We propose an alternative unsupervised strategy to learn medi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
136
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 84 publications
(136 citation statements)
references
References 31 publications
0
136
0
Order By: Relevance
“…Vision-Language Models have recently demonstrated great potential in learning generic visual representations and allowing zero-shot transfer to a variety of downstream classification tasks (Radford et al, 2021;Jia et al, 2021;Zhang et al, 2020). To our knowledge, the recent developments in vision-language learning, particularly CLIP (Radford et al, 2021) and ALIGN (Jia et al, 2021), are largely driven by advances in the following three areas: 1) text representation learning with Transformers (Vaswani et al, 2017), 2) large-minibatch contrastive representation learning He et al, 2020;Hénaff et al, 2020), and 3) web-scale training datasets-CLIP benefits from 400 million curated image-text pairs while ALIGN exploits 1.8 billion noisy image-text pairs.…”
Section: Related Workmentioning
confidence: 99%
“…Vision-Language Models have recently demonstrated great potential in learning generic visual representations and allowing zero-shot transfer to a variety of downstream classification tasks (Radford et al, 2021;Jia et al, 2021;Zhang et al, 2020). To our knowledge, the recent developments in vision-language learning, particularly CLIP (Radford et al, 2021) and ALIGN (Jia et al, 2021), are largely driven by advances in the following three areas: 1) text representation learning with Transformers (Vaswani et al, 2017), 2) large-minibatch contrastive representation learning He et al, 2020;Hénaff et al, 2020), and 3) web-scale training datasets-CLIP benefits from 400 million curated image-text pairs while ALIGN exploits 1.8 billion noisy image-text pairs.…”
Section: Related Workmentioning
confidence: 99%
“…Recent work Radford et al [2021], Cho et al [2021], Su et al [2019] has shown improvements of visual and textual encoders when learning from the contrast of image-text pairs and using natural language as supervision in addition to just visual images. This trend of improvements has also been observed in various classification use cases in the medical domain Zhang et al [2020]. Among these approaches, the contrastive pre-training of language-image data in CLIP Radford et al [2021] has been particularly successful.…”
mentioning
confidence: 57%
“…The contrastive loss has been widely adopted in representation learning [7,14] and more recently in image synthesis [13,33,42,75]. Given a batch of paired vectors (u, v) = {(u i , v i ), i = 1, 2, ..., N }, the symmetric cross-entropy loss [46,79] maximizes the similarity of the vectors in a pair while keeping non-paired vectors apart…”
Section: Losses and Training Proceduresmentioning
confidence: 99%