Few-shot Adaptation of Multi-modal Foundation Models: A Survey
Fan Liu,
Tianshu Zhang,
Wenwen Dai
et al.
Abstract:Multi-modal models, such as CLIP, are replacing traditional supervised pre-training models (e.g., ImageNet-based pre-training) as the new generation of foundational visual models. These multi-modal models with robust and aligned semantic representations from billions of internet image-text pairs and can be applied to various downstream zero-shot tasks. However, in some fine-grained domains like medical imaging and remote sensing, the performance of multi-modal foundation models often leaves much to be desired.… Show more
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.