2021
DOI: 10.48550/arxiv.2111.11432
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Florence: A New Foundation Model for Computer Vision

Abstract: Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021), and Wu D… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
180
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 113 publications
(182 citation statements)
references
References 56 publications
2
180
0
Order By: Relevance
“…Our results with just K400 (86.7%) is already similar to recent 86.5% Florence [95] and 86.8% SwinV2-G [58]. Florence uses 900M curated text-image pairs.…”
Section: Main Results On Kineticssupporting
confidence: 86%
See 1 more Smart Citation
“…Our results with just K400 (86.7%) is already similar to recent 86.5% Florence [95] and 86.8% SwinV2-G [58]. Florence uses 900M curated text-image pairs.…”
Section: Main Results On Kineticssupporting
confidence: 86%
“…Note that these models are strong baselines and are state-of-the-art for training-from-scratch on their own. Still, 300 epochs of MaskFeat pre-training improve the scratch MViT-S, 16×4 [56] Sup., JFT-300M 84.9 95.8 3981×3×4 TokenLearner [75] Sup., JFT-300M 85.4 N/A 4076×3×4 Florence↑384 [95] Text, FLD-900M 86. 5 3.…”
Section: Main Results On Kineticsmentioning
confidence: 99%
“…2. In the second part of the tables, we compare to methods that are pretrained on web-scale datasets such as Instagram 65M [25], JFT-300M [62], JFT-3B [81], WTS [61], Florence [80] or HowTo100M [47]. Observe that we achieve state-of-the-art results both with and without web-scale pretraining.…”
Section: Comparison To the State-of-the-artmentioning
confidence: 98%
“…In contrast, our benchmark focuses on task-level transfer across domains, i.e., it aims to evaluate the transferability of models, by pre-training from their own large corpus, then evaluating zero-shot performance on a diverse set of downstream datasets. This setting has been recently studied [32,51,33,72], and is arguably more practical for real-world applications, as it brings the convenience towards the spirit of one-model-for-all. The well-known ImageNet-1K dataset [9] was originally proposed as a large dataset for model training and testing.…”
Section: Visual Recognition Benchmarks: Zero-shot and Transfer Learningmentioning
confidence: 99%
“…The success has quickly inspired many follow-up large-scale pre-training works [68,72,69,36,43,20,34,74]. Each of them developed their own evaluation experiments, covering a customized subset of tasks, and leaving the details of model adaptation process less accessible.…”
Section: Introductionmentioning
confidence: 99%