2021
DOI: 10.48550/arxiv.2106.05967
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations

Abstract: Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection. However, current methods are still primarily applied to curated datasets like ImageNet. In this paper, we first study how biases in the dataset affect existing methods. Our results show that an approach like MoCo [22] works surprisingly well across: (i) object-versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets. Second,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(8 citation statements)
references
References 46 publications
(103 reference statements)
0
8
0
Order By: Relevance
“…The model can match positive pairs by attending only to the essential part of the representation, while ignoring other non-essential variations. As a result, different images with similar visual concepts are grouped together, inducing a latent space with rich semantic information [65,10,66]. This is evidenced by the results shown in Figure 1, where MoCo [25] achieve high performance on tasks that require a deeper semantic understanding of images.…”
Section: Semantic Correspondence Learningmentioning
confidence: 89%
“…The model can match positive pairs by attending only to the essential part of the representation, while ignoring other non-essential variations. As a result, different images with similar visual concepts are grouped together, inducing a latent space with rich semantic information [65,10,66]. This is evidenced by the results shown in Figure 1, where MoCo [25] achieve high performance on tasks that require a deeper semantic understanding of images.…”
Section: Semantic Correspondence Learningmentioning
confidence: 89%
“…2) On VOC, we follow Van Gansbeke et al (2021) to predict semantic segmentation via nearest neighbor search from the labeled VOC training set. We also evaluate performance by fine-tuning models on the training set and testing on the validation set.…”
Section: Methodsmentioning
confidence: 99%
“…Recent works can be categorized into three camps. 1) A straightforward approach is to leverage self-supervised image recognition and transfer the model to segmentation by increasing the location sensitivity (Wu et al, 2018;He et al, 2020;Chen et al, 2020;Wang et al, 2021c), adding an contrastive loss across views (Wang et al, 2021b), or by stronger augmentation and constrained cropping (Van Gansbeke et al, 2021;Selvaraju et al, 2021). 2) A pixel-wise cluster predictor can be learned by maximizing the mutual information between cluster predictions on augmented views of the same instance at corresponding pixels (Ji et al, 2019;Ouali et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…Nearest-neighbor supervision Recently, researchers have exploited nearest-neighbor supervision to learn visual features (Dwibedi et al, 2021;Van Gansbeke et al, 2021). They find that using nearest-neighbor as positive samples in the contrastive loss improves the performances on multiple downstream tasks.…”
Section: Supervisionmentioning
confidence: 99%
“…4(b). This intuitive idea is akin to the successful Multi-crop transformation (Caron et al, 2020;Van Gansbeke et al, 2021) in image SSL. We further extend it into the multi-modal setting.…”
Section: Multi-view Supervisionmentioning
confidence: 99%