Biased Mixtures of Experts: Enabling Computer Vision Inference Under Data Transfer Limitations

Abbas, Alhabib; Andreopoulos, Yiannis

doi:10.1109/tip.2020.3005508

Cited by 5 publications

(4 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, we add a small amount of noise with standard deviation 1 E to the activations Wx, which we find improves performance. We empirically found this performed well but that the setup was robust to this parameter.…”

Section: Routingmentioning

confidence: 95%

“…For each MoE layer in V-MoE, we use the routing function g(x) = TOP k (softmax (Wx + )), where TOP k is an operation that sets all elements of the vector to zero except the elements with the largest k values, and is sampled independently ∼ N (0, 1 E 2 ) entry-wise. In practice, we use k = 1 or k = 2.…”

Section: Routingmentioning

confidence: 99%

“…[40] alternatively enforced a balanced routing by solving a linear assignment problem. [21,2,25,1,63,47,64] focused on architectures whose scale is considerably smaller than that of both language models and our model.…”

Section: Related Workmentioning

confidence: 99%

“…Other approaches use shallow MoEs, learning a single router, either disjointly [25] or jointly [2], together with CNNs playing the role of experts. [1] further have a cost-aware procedure to bias the assignments of inputs across the experts. Unlike shallow MoEs, we operate with up to several tens of routing decisions per token along the depth of the model.…”

Section: Moes For Vision For Computer Vision Previous Work On Moesmentioning

confidence: 99%

See 3 more Smart Citations

Scaling Vision with Sparse Mixture of Experts

Riquelme¹,

Puigcerver²,

Mustafa³

et al. 2021

Preprint

View full text Add to dashboard Cite

Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-ofthe-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.

show abstract

Section: Routingmentioning

confidence: 95%

Section: Routingmentioning

confidence: 99%