DisCo: Remedying Self-supervised Learning on Lightweight Models with Distilled Contrastive Learning

Gao, Yuting; Zhuang, Jiaxin; Lin, Shaohui; Cheng, Hao; Sun, Xing; Li, Ké; Shen, Chunhua

doi:10.1007/978-3-031-19809-0_14

Cited by 22 publications

(12 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Importantly, RoB removes the regularisation terms that aim at preventing collapse from the loss, and use identical-view predictions instead of cross-view predictions in the loss. [42], DisCo [14] or SimReg [30], have noticed that jointembedding self-supervised learning methods such as SwAV [8], MoCo [11,21] or DINO [9] suffer from a drop in performance when applied on low compute neural nets. These works have proposed to use Knowledge Distillation [23] to circumvent those difficulties.…”

Section: Related Workmentioning

confidence: 99%

“…CompRess [27] and SEED [13] use a memory queue like MoCo [21] to distill the knowledge of the teacher by minimizing the cross-entropy between the probability distribution of the teacher and student obtained by comparing a sample to each point in the queue. DisCo [14] and BINGO [42] makes use of contrastive learning, with BINGO additionally grouping samples into cluster of related samples. Finally, SimReg [30] proposes regression as a generic way to transfer feature representation from a teacher to a student.…”

Section: Related Workmentioning

confidence: 99%

“…We compare RoB to the self-supervised distillation methods CompRess [27], SEED [13], BINGO [42] SimReg [30] and DisCo [14] on ResNet18, ResNet34 and MobileNetV3. We focus on the ImageNet-1k linear probing, the common benchmark reported across previous works.…”

Section: Comparison With the State Of The Artmentioning

confidence: 99%

“…Comparison with the state of the art. We compare our MobileNetV3 student with SEED [13], BINGO [42] and DisCo [14] on the ImageNet top-1 accuracy. Again, we use the MobileNetV3 head to stay consistent with Mo-bileNetV3 [24] architecture, instead of a linear layer.…”

Section: A1 Tiny Compute Experimentsmentioning

confidence: 99%

“…shot, and semi-supervised settings. We show that Knowledge Distillation [13,14,23,27,30,42], originally developed for supervised learning, can be easily adapted to transfer knowledge from a large self-supervised model (the teacher) to a compact smaller model (the student).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A Simple Recipe for Competitive Low-compute Self supervised Vision Models

Duval¹,

Misra²,

Ballas³

2023

Preprint

View full text Add to dashboard Cite

Self-supervised methods in vision have been mostly focused on large architectures as they seem to suffer from a significant performance drop for smaller architectures. In this paper, we propose a simple self-supervised distillation technique that can train high performance low-compute neural networks. Our main insight is that existing joint-embedding based SSL methods can be repurposed for knowledge distillation from a large self-supervised teacher to a small student model. Thus, we call our method Replace one Branch (RoB) as it simply replaces one branch of the joint-embedding training with a large teacher model. RoB is widely applicable to a number of architectures such as small ResNets, MobileNets and ViT, and pretrained models such as DINO, SwAV or iBOT. When pretraining on the ImageNet dataset, RoB yields models that compete with supervised knowledge distillation. When applied to MSN, RoB produces students with strong semi-supervised capabilities. Finally, our best ViT-Tiny models improve over prior SSL state-of-the-art on ImageNet by 2.3% and are on par or better than a supervised distilled DeiT on five downstream transfer tasks (iNaturalist, CIFAR, Clevr/Count, Clevr/Dist and Places). We hope RoB enables practical self-supervision at smaller scale.

show abstract