Scaling Vision with Sparse Mixture of Experts

Riquelme, Carlos; Puigcerver, Joan; Mustafa, Basil; Neumann, Maxim; Jenatton, Rodolphe; Pinto, André Susano; Keysers, Daniel; Houlsby, Neil

doi:10.48550/arxiv.2106.05974

Cited by 13 publications

(26 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When capacity factors are less than one, the model learns to not apply computation to certain tokens. This has shown promise in computer vision (Riquelme et al, 2021) and our language experiments (Appendix D). We envision future models expanding this through heterogeneous experts (e.g.…”

Section: Discussionmentioning

confidence: 91%

See 1 more Smart Citation

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Zoph¹,

Bello²,

Kumar³

et al. 2022

Preprint

View full text Add to dashboard Cite

Scale has opened new frontiers in natural language processing -but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixtureof-Experts or ST-MoE-32B). For the first time, a sparse model achieves stateof-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3). 1

show abstract

Section: Discussionmentioning

confidence: 91%

“…Batch Prioritized Routing (BPR) from Riquelme et al (2021) was introduced in Vision Transformers (Dosovitskiy et al, 2020) for image classification. Our work explores BPR with top-1 routing in the context of language modeling.…”

Section: Batch Prioritized Routing For Lower Capacity Factorsmentioning

confidence: 99%

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Zoph¹,

Bello²,

Kumar³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Most of them introduce locality into Transformer via introducing local attention mechanism [33], [42], [52], [53] or convolution [45]- [47]. Nowadays, the most recent supervised Transformers are exploring both the structural combination [37], [50] and scaling laws [36], [114]. In addition to supervised Transformers, self-supervised learning accounts for a substantial part of vision Transformers [61]- [66].…”

Section: ) Experimental Evaluation and Comparative Analysismentioning

confidence: 99%

A Survey of Visual Transformers

Liu¹,

Zhang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing. Inspired by this significant achievement, some pioneering works have recently been done on adapting Transformerliked architectures to Computer Vision (CV) fields, which have demonstrated their effectiveness on various CV tasks. Relying on competitive modeling capability, visual Transformers have achieved impressive performance on multiple benchmarks such as ImageNet, COCO and ADE20k as compared with modern Convolution Neural Networks (CNN). In this paper, we have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks (classification, detection, and segmentation), where a taxonomy is proposed to organize these methods according to their motivations, structures, and usage scenarios. Because of the differences in training settings and oriented tasks, we have also evaluated these methods on different configurations for easy and intuitive comparison instead of only various benchmarks. Furthermore, we have revealed a series of essential but unexploited aspects that may empower Transformer to stand out from numerous architectures, e.g., slack high-level semantic embeddings to bridge the gap between visual and sequential Transformers. Finally, three promising future research directions are suggested for further investment.

show abstract

“…Sparse models only apply a small subset of the parameters to process each example, as dictated by trainable routers. This conditional computation approach was then extended in Riquelme et al (2021) to Vision Mixture of Experts (V-MoE). Whilst V-MoE successfully increases the number of parameters by an order of magnitude, the model still handles increasingly more tokens at higher resolutions.…”

Section: Introductionmentioning

confidence: 99%

Learning to Merge Tokens in Vision Transformers

Renggli¹,

Pinto²,

Houlsby³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Transformers are widely applied to solve natural language understanding and computer vision tasks. While scaling up these architectures leads to improved performance, it often comes at the expense of much higher computational costs. In order for large-scale models to remain practical in real-world systems, there is a need for reducing their computational overhead. In this work, we present the PATCHMERGER, a simple module that reduces the number of patches or tokens the network has to process by merging them between two consecutive intermediate layers. We show that the PATCHMERGER achieves a significant speedup across various model sizes while matching the original performance both upstream and downstream after fine-tuning.

show abstract

Scaling Vision with Sparse Mixture of Experts

Cited by 13 publications

References 48 publications

ST-MoE: Designing Stable and Transferable Sparse Expert Models

ST-MoE: Designing Stable and Transferable Sparse Expert Models

A Survey of Visual Transformers

Learning to Merge Tokens in Vision Transformers

Contact Info

Product

Resources

About