Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations

Gansbeke, Wouter Van; Vandenhende, Simon; Georgoulis, Stamatios; Gool, Luc Van

doi:10.48550/arxiv.2106.05967

Cited by 3 publications

(8 citation statements)

References 46 publications

(103 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The model can match positive pairs by attending only to the essential part of the representation, while ignoring other non-essential variations. As a result, different images with similar visual concepts are grouped together, inducing a latent space with rich semantic information [65,10,66]. This is evidenced by the results shown in Figure 1, where MoCo [25] achieve high performance on tasks that require a deeper semantic understanding of images.…”

Section: Semantic Correspondence Learningmentioning

confidence: 89%

Semantic-Aware Fine-Grained Correspondence

Hu¹,

Wang²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Establishing visual correspondence across images is a challenging and essential task. Recently, an influx of self-supervised methods have been proposed to better learn representations for visual correspondence. However, we find that these methods often fail to leverage semantic information and over-rely on the matching of low-level features. In contrast, human vision is capable of distinguishing between distinct objects as a pretext to tracking. Inspired by this paradigm, we propose to learn semantic-aware fine-grained correspondence. Firstly, we demonstrate that semantic correspondence is implicitly available through a rich set of image-level self-supervised methods. We further design a pixel-level self-supervised learning objective which specifically targets fine-grained correspondence. For downstream tasks, we fuse these two kinds of complementary correspondence representations together, demonstrating that they boost performance synergistically. Our method surpasses previous state-of-the-art self-supervised methods using convolutional networks on a variety of visual correspondence tasks, including video object segmentation, human pose tracking, and human part tracking. Code is available at https://github.com/Alxead/SFC.

show abstract

Section: Semantic Correspondence Learningmentioning

confidence: 89%

Semantic-Aware Fine-Grained Correspondence

Hu¹,

Wang²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…2) On VOC, we follow Van Gansbeke et al (2021) to predict semantic segmentation via nearest neighbor search from the labeled VOC training set. We also evaluate performance by fine-tuning models on the training set and testing on the validation set.…”

Section: Methodsmentioning

confidence: 99%

“…Recent works can be categorized into three camps. 1) A straightforward approach is to leverage self-supervised image recognition and transfer the model to segmentation by increasing the location sensitivity (Wu et al, 2018;He et al, 2020;Chen et al, 2020;Wang et al, 2021c), adding an contrastive loss across views (Wang et al, 2021b), or by stronger augmentation and constrained cropping (Van Gansbeke et al, 2021;Selvaraju et al, 2021). 2) A pixel-wise cluster predictor can be learned by maximizing the mutual information between cluster predictions on augmented views of the same instance at corresponding pixels (Ji et al, 2019;Ouali et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

CAST: Concurrent Recognition and Segmentation with Adaptive Segment Tokens

Ke¹,

Hwang²,

Yu³

2022

Preprint

View full text Add to dashboard Cite

Recognizing an image and segmenting it into coherent regions are often treated as separate tasks. Human vision, however, has a general sense of segmentation hierarchy before recognition occurs. We are thus inspired to learn image recognition with hierarchical image segmentation based entirely on unlabeled images. Our insight is to learn fine-to-coarse features concurrently at superpixels, segments, and full image levels, enforcing consistency and goodness of feature induced segmentations while maximizing discrimination among image instances. Our model innovates vision transformers on three aspects. 1) We use adaptive segment tokens instead of fixed-shape patch tokens. 2) We create a token hierarchy by inserting graph pooling between transformer blocks, naturally producing consistent multi-scale segmentations while increasing the segment size and reducing the number of tokens. 3) We produce hierarchical image segmentation for free while training for recognition by maximizing image-wise discrimination. Our work delivers the first concurrent recognition and hierarchical segmentation model without any supervision. Validated on ImageNet and PASCAL VOC, it achieves better recognition and segmentation with higher computational efficiency.

show abstract

“…Nearest-neighbor supervision Recently, researchers have exploited nearest-neighbor supervision to learn visual features (Dwibedi et al, 2021;Van Gansbeke et al, 2021). They find that using nearest-neighbor as positive samples in the contrastive loss improves the performances on multiple downstream tasks.…”

Section: Supervisionmentioning

confidence: 99%

“…4(b). This intuitive idea is akin to the successful Multi-crop transformation (Caron et al, 2020;Van Gansbeke et al, 2021) in image SSL. We further extend it into the multi-modal setting.…”

Section: Multi-view Supervisionmentioning

confidence: 99%

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Li¹,

Liang²,

Zhao³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) (Radford et al., 2021) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our DeCLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from these intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1× fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework. Our code, dataset and models are released at: https://github.com/Sense-GVT/ * The first three authors contribute equally. The order is determined by dice rolling.

show abstract

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations

Cited by 3 publications

References 46 publications

Semantic-Aware Fine-Grained Correspondence

Semantic-Aware Fine-Grained Correspondence

CAST: Concurrent Recognition and Segmentation with Adaptive Segment Tokens

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Contact Info

Product

Resources

About