Contrastive Learning of Medical Visual Representations from Paired Images and Text

Zhang, Yuhao; Jiang, Hang; Miura, Yasuhide; Manning, Christopher D.; Langlotz, Curtis P.

doi:10.48550/arxiv.2010.00747

Cited by 84 publications

(136 citation statements)

References 31 publications

Supporting

Mentioning

136

Contrasting

Order By: Relevance

“…Vision-Language Models have recently demonstrated great potential in learning generic visual representations and allowing zero-shot transfer to a variety of downstream classification tasks (Radford et al, 2021;Jia et al, 2021;Zhang et al, 2020). To our knowledge, the recent developments in vision-language learning, particularly CLIP (Radford et al, 2021) and ALIGN (Jia et al, 2021), are largely driven by advances in the following three areas: 1) text representation learning with Transformers (Vaswani et al, 2017), 2) large-minibatch contrastive representation learning He et al, 2020;Hénaff et al, 2020), and 3) web-scale training datasets-CLIP benefits from 400 million curated image-text pairs while ALIGN exploits 1.8 billion noisy image-text pairs.…”

Section: Related Workmentioning

confidence: 99%

Learning to Prompt for Vision-Language Models

Zhou,

Yang,

Loy

et al. 2021

Preprint

144

View full text Add to dashboard Cite

Vision-language pre-training has recently emerged as a promising alternative for representation learning. It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders. Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks since visual concepts can be diametrically generated from natural language, known as prompt. In this paper, we identify that a major challenge of deploying such models in practice is prompt engineering. This is because designing a proper prompt, especially for context words surrounding a class name, requires domain expertise and typically takes a significant amount of time for words tuning since a slight change in wording could have a huge impact on performance. Moreover, different downstream tasks require specific designs, further hampering the efficiency of deployment. To overcome this challenge, we propose a novel approach named context optimization (CoOp). The main idea is to model context in prompts using continuous representations and perform end-to-end learning from data while keeping the pre-trained parameters fixed. In this way, the design of task-relevant prompts can be fully automated. Experiments on 11 datasets show that CoOp effectively turns pre-trained vision-language models into data-efficient visual learners, requiring as few as one or two shots to beat hand-crafted prompts with a decent margin and able to gain significant improvements when using more shots (e.g., at 16 shots the average gain is around 17% with the highest reaching over 50%). CoOp also exhibits strong robustness to distribution shift.

show abstract

Section: Related Workmentioning

confidence: 99%

Learning to Prompt for Vision-Language Models

Zhou,

Yang,

Loy

et al. 2021

Preprint

144

View full text Add to dashboard Cite

show abstract

“…Recent work Radford et al [2021], Cho et al [2021], Su et al [2019] has shown improvements of visual and textual encoders when learning from the contrast of image-text pairs and using natural language as supervision in addition to just visual images. This trend of improvements has also been observed in various classification use cases in the medical domain Zhang et al [2020]. Among these approaches, the contrastive pre-training of language-image data in CLIP Radford et al [2021] has been particularly successful.…”

mentioning

confidence: 57%

Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?

Eslami¹,

Melo²,

Meinel³

2021

Preprint

View full text Add to dashboard Cite

Objective: Contrastive Language-Image Pre-training (CLIP) has shown remarkable success in learning with cross-modal supervision from extensive amounts of image-text pairs collected online. Thus far, the effectiveness of CLIP has been investigated primarily in general-domain multimodal problems. This work evaluates the effectiveness of CLIP for the task of Medical Visual Question Answering (MedVQA). To this end, we present PubMedCLIP, a fine-tuned version of CLIP for the medical domain based on PubMed articles. Materials and Methods: Our experiments are conducted on two MedVQA benchmark datasets and investigate two MedVQA methods, MEVF (Mixture of Enhanced Visual Features) and QCR (Question answering via Conditional Reasoning). For each of these, we assess the merits of visual representation learning using PubMedCLIP, the original CLIP, and state-of-the-art MAML (Model-Agnostic Meta-Learning) networks pre-trained only on visual data. We open source the code for our MedVQA pipeline and pre-training PubMedCLIP. Results: CLIP and PubMedCLIP achieve improvements in comparison to MAML's visual encoder. PubMedCLIP achieves the best results with gains in the overall accuracy of up to 3%. Individual examples illustrate the strengths of PubMedCLIP in comparison to the previously widely used MAML networks. Discussion and conclusion: Visual representation learning with language supervision in PubMed-CLIP leads to noticeable improvements for MedVQA. Our experiments reveal distributional differences in the two MedVQA benchmark datasets that have not been imparted in previous work and cause different back-end visual encoders in PubMedCLIP to exhibit different behavior on these datasets. Moreover, we witness fundamental performance differences of VQA in general versus medical domains. Keywords Medical visual question answering • Deep representation learning • CLIP • PubMedCLIP BACKGROUND AND SIGNIFICANCEMedical visual question answering (MedVQA) is the task of answering natural language questions about a given medical image. To solve such multimodal tasks, a system must interpret both visual and textual data as well as infer the associations between a given image and a pertinent question sufficiently well to elicit an answer Antol et al. [2015]. The development of MedVQA has considerable potential to benefit healthcare systems, as it may aid clinicians in interpreting medical images, obtaining more accurate diagnoses by consulting a second opinion, and ultimately, may expedite and improve patient care. Achieving this in the medical domain in particular is non-trivial, as we suffer from a

show abstract

“…The contrastive loss has been widely adopted in representation learning [7,14] and more recently in image synthesis [13,33,42,75]. Given a batch of paired vectors (u, v) = {(u i , v i ), i = 1, 2, ..., N }, the symmetric cross-entropy loss [46,79] maximizes the similarity of the vectors in a pair while keeping non-paired vectors apart…”

Section: Losses and Training Proceduresmentioning

confidence: 99%

Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Huang¹,

Mallya²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Existing conditional image synthesis frameworks generate images based on user inputs in a single modality, such as text, segmentation, sketch, or style reference. They are often unable to leverage multimodal user inputs when available, which reduces their practicality. To address this limitation, we propose the Product-of-Experts Generative Adversarial Networks (PoE-GAN) framework, which can synthesize images conditioned on multiple input modalities or any subset of them, even the empty set. PoE-GAN consists of a productof-experts generator and a multimodal multiscale projection discriminator. Through our carefully designed training scheme, PoE-GAN learns to synthesize images with high quality and diversity. Besides advancing the state of the art in multimodal conditional image synthesis, PoE-GAN also outperforms the best existing unimodal conditional image synthesis approaches when tested in the unimodal setting. The project website is available at this link.

show abstract

Contrastive Learning of Medical Visual Representations from Paired Images and Text

Cited by 84 publications

References 31 publications

Learning to Prompt for Vision-Language Models

Learning to Prompt for Vision-Language Models

Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?

Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Contact Info

Product

Resources

About