Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning

Rosenbaum, Clemens M.; Klinger, Tim; Riemer, Matthew

doi:10.48550/arxiv.1711.01239

Cited by 38 publications

(54 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To grow the number of model parameters without proportionally increasing the computational cost, conditional computation [5,15,12] only activates some relevant parts of the model in an input-dependent fashion, like in decision trees [7]. In deep learning, the activation of portions of the model can use stochastic neurons [6] or reinforcement learning [4,17,53].…”

Section: Related Workmentioning

confidence: 99%

Scaling Vision with Sparse Mixture of Experts

Riquelme¹,

Puigcerver²,

Mustafa³

et al. 2021

Preprint

View full text Add to dashboard Cite

Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-ofthe-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.

show abstract

Section: Related Workmentioning

confidence: 99%

Scaling Vision with Sparse Mixture of Experts

Riquelme¹,

Puigcerver²,

Mustafa³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…BASELayers (Lewis et al, 2021) circumvents this problem by treating the routing mechanism as a linear expert-to-task assignment problem, without the need of auxiliary loss. Routing networks (Rosenbaum et al, 2017) learn better task representations by clustering and disentangling parameters conditioned on input.…”

Section: Related Workmentioning

confidence: 99%

Tricks for Training Sparse Translation Models

Dua¹,

Bhosale²,

Goswami³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multi-task learning with an unbalanced data distribution skews model learning towards high resource tasks, especially when model capacity is fixed and fully shared across all tasks. Sparse scaling architectures, such as BASE-Layers, provide flexible mechanisms for different tasks to have a variable number of parameters, which can be useful to counterbalance skewed data distributions. We find that that sparse architectures for multilingual machine translation can perform poorly out of the box, and propose two straightforward techniques to mitigate this -a temperature heating mechanism and dense pre-training. Overall, these methods improve performance on two multilingual translation benchmarks compared to standard BASELayers and Dense scaling baselines, and in combination, more than 2x model convergence speed.

show abstract

“…Multi-task Learning Due to its benefit with regards to data and computational efficiency, multi-task learning (MTL) has broad applications in vision, language, and robotics [11,28,22,44,38]. A number of MTL-friendly architectures have been proposed using task-specific modules [25,11], attentionbased mechanisms [21] or activating different paths along the deep networks to tackle MTL [27,40].…”

Section: Related Workmentioning

confidence: 99%

Conflict-Averse Gradient Descent for Multi-task Learning

Liu¹,

Liu²,

Jin³

et al. 2021

Preprint

View full text Add to dashboard Cite

The goal of multi-task learning is to enable more efficient learning than single task learning by sharing model structures for a diverse set of tasks. A standard multi-task learning objective is to minimize the average loss across all tasks. While straightforward, using this objective often results in much worse final performance for each task than learning them independently. A major challenge in optimizing a multi-task model is the conflicting gradients, where gradients of different task objectives are not well aligned so that following the average gradient direction can be detrimental to specific tasks' performance. Previous work has proposed several heuristics to manipulate the task gradients for mitigating this problem. But most of them lack convergence guarantee and/or could converge to any Pareto-stationary point. In this paper, we introduce Conflict-Averse Gradient descent (CAGrad) which minimizes the average loss function, while leveraging the worst local improvement of individual tasks to regularize the algorithm trajectory. CAGrad balances the objectives automatically and still provably converges to a minimum over the average loss. It includes the regular gradient descent (GD) and the multiple gradient descent algorithm (MGDA) in the multi-objective optimization (MOO) literature as special cases. On a series of challenging multi-task supervised learning and reinforcement learning tasks, CAGrad achieves improved performance over prior state-of-the-art multi-objective gradient manipulation methods. Code is available at https://github.com/Cranial-XIX/CAGrad.

show abstract

Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning

Cited by 38 publications

References 12 publications

Scaling Vision with Sparse Mixture of Experts

Scaling Vision with Sparse Mixture of Experts

Tricks for Training Sparse Translation Models

Conflict-Averse Gradient Descent for Multi-task Learning

Contact Info

Product

Resources

About