2021
DOI: 10.48550/arxiv.2106.05974
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Scaling Vision with Sparse Mixture of Experts

Abstract: Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-ofthe-art networks, while requiring as little as … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
26
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
6
3

Relationship

1
8

Authors

Journals

citations
Cited by 13 publications
(26 citation statements)
references
References 48 publications
0
26
0
Order By: Relevance
“…When capacity factors are less than one, the model learns to not apply computation to certain tokens. This has shown promise in computer vision (Riquelme et al, 2021) and our language experiments (Appendix D). We envision future models expanding this through heterogeneous experts (e.g.…”
Section: Discussionmentioning
confidence: 91%
See 1 more Smart Citation
“…When capacity factors are less than one, the model learns to not apply computation to certain tokens. This has shown promise in computer vision (Riquelme et al, 2021) and our language experiments (Appendix D). We envision future models expanding this through heterogeneous experts (e.g.…”
Section: Discussionmentioning
confidence: 91%
“…Batch Prioritized Routing (BPR) from Riquelme et al (2021) was introduced in Vision Transformers (Dosovitskiy et al, 2020) for image classification. Our work explores BPR with top-1 routing in the context of language modeling.…”
Section: Batch Prioritized Routing For Lower Capacity Factorsmentioning
confidence: 99%
“…Most of them introduce locality into Transformer via introducing local attention mechanism [33], [42], [52], [53] or convolution [45]- [47]. Nowadays, the most recent supervised Transformers are exploring both the structural combination [37], [50] and scaling laws [36], [114]. In addition to supervised Transformers, self-supervised learning accounts for a substantial part of vision Transformers [61]- [66].…”
Section: ) Experimental Evaluation and Comparative Analysismentioning
confidence: 99%
“…Sparse models only apply a small subset of the parameters to process each example, as dictated by trainable routers. This conditional computation approach was then extended in Riquelme et al (2021) to Vision Mixture of Experts (V-MoE). Whilst V-MoE successfully increases the number of parameters by an order of magnitude, the model still handles increasingly more tokens at higher resolutions.…”
Section: Introductionmentioning
confidence: 99%