2023
DOI: 10.21203/rs.3.rs-3497100/v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Few-shot Adaptation of Multi-modal Foundation Models: A Survey

Fan Liu,
Tianshu Zhang,
Wenwen Dai
et al.

Abstract: Multi-modal models, such as CLIP, are replacing traditional supervised pre-training models (e.g., ImageNet-based pre-training) as the new generation of foundational visual models. These multi-modal models with robust and aligned semantic representations from billions of internet image-text pairs and can be applied to various downstream zero-shot tasks. However, in some fine-grained domains like medical imaging and remote sensing, the performance of multi-modal foundation models often leaves much to be desired.… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
references
References 56 publications
(75 reference statements)
0
0
0
Order By: Relevance