2023
DOI: 10.48550/arxiv.2301.07094
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning Customized Visual Models with Retrieval-Augmented Knowledge

Abstract: Image-text contrastive learning models such as CLIP have demonstrated strong task transfer ability. The high generality and usability of these visual models is achieved via a web-scale data collection process to ensure broad concept coverage, followed by expensive pre-training to feed all the knowledge into model weights. Alternatively, we propose REACT, REtrieval-Augmented CusTomization, a framework to acquire the relevant web knowledge to build customized visual models for target domains. We retrieve the mos… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(6 citation statements)
references
References 60 publications
0
6
0
Order By: Relevance
“…Several works have focused on ways of improving upon different aspects of the contrastive vision-text models, such as their training objectives [12,15,66] or through scaling [9,47]. Yet, only little exploration has been done on their combination with memory or knowledge-based techniques [2,14,37,54]. REACT [37] retrieves image-text pairs from an external memory in order to build a training dataset specialized for a specific downstream task.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Several works have focused on ways of improving upon different aspects of the contrastive vision-text models, such as their training objectives [12,15,66] or through scaling [9,47]. Yet, only little exploration has been done on their combination with memory or knowledge-based techniques [2,14,37,54]. REACT [37] retrieves image-text pairs from an external memory in order to build a training dataset specialized for a specific downstream task.…”
Section: Related Workmentioning
confidence: 99%
“…Yet, only little exploration has been done on their combination with memory or knowledge-based techniques [2,14,37,54]. REACT [37] retrieves image-text pairs from an external memory in order to build a training dataset specialized for a specific downstream task. Unlike REACT [37], our work does not require any pre-knowledge about the nature of the downstream task, and is hence applicable in a full zero-shot transfer.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations