Gradient Surgery for Multi-Task Learning

Yu, Tong; Kumar, Saurabh; Gupta, Abhishek; Levine, Sergey; Hausman, Karol; Finn, Chelsea

doi:10.48550/arxiv.2001.06782

Cited by 84 publications

(138 citation statements)

References 0 publications

Supporting

Mentioning

137

Contrasting

Order By: Relevance

“…Reweighting has also become popular in multitask learning (Chen et al, 2018;Kendall et al, 2018), where different tasks must be balanced with each other for optimal training. Multitask learning also has popularized gradient comparison techniques (Yu et al, 2020;Chen et al, 2020), which we leverage heavily within this current work.…”

Section: Related Workmentioning

confidence: 99%

GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting

Chen¹,

Casser²,

Kretzschmar³

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose GradTail, an algorithm that uses gradients to improve model performance on the fly in the face of long-tailed training data distributions. Unlike conventional long-tail classifiers which operate on converged -and possibly overfit -models, we demonstrate that an approach based on gradient dot product agreement can isolate long-tailed data early on during model training and improve performance by dynamically picking higher sample weights for that data. We show that such upweighting leads to model improvements for both classification and regression models, the latter of which are relatively unexplored in the long-tail literature, and that the long-tail examples found by gradient alignment are consistent with our semantic expectations.

show abstract

Section: Related Workmentioning

confidence: 99%

GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting

Chen¹,

Casser²,

Kretzschmar³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Sharing parameters across tasks [Parisotto et al, 2015, Rusu et al, 2015, Teh et al, 2017 usually results in conflicting gradients from different tasks. One way to mitigate this is to explicitly model the similarity between gradients obtained from different tasks [Yu et al, 2020, Zhang and Yeung, 2014, Kendall et al, 2018, Lin et al, 2019, Sener and Koltun, 2018, Du et al, 2018. On the other hand, researchers propose to utilize different modules for different tasks, thus reducing the interference of gradients from different tasks [Singh, 1992, Andreas et al, 2017, Rusu et al, 2016, Qureshi et al, 2019, Peng et al, 2019, Haarnoja et al, 2018, Sahni et al, 2017.…”

Section: Related Workmentioning

confidence: 99%

“…Multi-task learning is notoriously difficult [Caruana, 1997, Ruder, 2017 and Yu et al [2020] hypothesize that the optimization difficulties might be due to the gradients from different tasks confliciting with each other thus hurting the learning process. In this work, we propose a multi-task bilevel learning framework for more effective multi-objective curricula DRL learning.…”

Section: Introductionmentioning

confidence: 99%

Learning Multi-Objective Curricula for Deep Reinforcement Learning

Kang¹,

Liu²,

Gupta³

et al. 2021

Preprint

View full text Add to dashboard Cite

Various automatic curriculum learning (ACL) methods have been proposed to improve the sample efficiency and final performance of deep reinforcement learning (DRL). They are designed to control how a DRL agent collects data, which is inspired by how humans gradually adapt their learning processes to their capabilities. For example, ACL can be used for subgoal generation, reward shaping, environment generation, or initial state generation. However, prior work only considers curriculum learning following one of the aforementioned predefined paradigms. It is unclear which of these paradigms are complementary, and how the combination of them can be learned from interactions with the environment. Therefore, in this paper, we propose a unified automatic curriculum learning framework to create multi-objective but coherent curricula that are generated by a set of parametric curriculum modules. Each curriculum module is instantiated as a neural network and is responsible for generating a particular curriculum. In order to coordinate those potentially conflicting modules in unified parameter space, we propose a multi-task hyper-net learning framework that uses a single hyper-net to parameterize all those curriculum modules. In addition to existing hand-designed curricula paradigms, we further design a flexible memory mechanism to learn an abstract curriculum, which may otherwise be difficult to design manually. We evaluate our method on a series of robotic manipulation tasks and demonstrate its superiority over other state-of-the-art ACL methods in terms of sample efficiency and final performance.

show abstract

“…We do so by alternating gradient updates on batches sampled from each dataset in turn. Further details are in Appendix E. • CoTrain + PCGrad: An extension of CoTrain, where we leverage the method PCGrad [72] to perform gradient projection and prevent destructive gradient interference between updates from D PT and D FT . Further details and variants we tried are in Appendix E.…”

Section: Problem Setupmentioning

confidence: 99%

“…CoTrain + PCGrad details: In our implementation, we computed gradient updates using a batch of data from D PT and D FT separately, averaging the losses across the set of binary tasks in each dataset (5000 for D PT and 40 for D FT ). PCGrad [72] was then used to compute the final gradient update given these two averaged losses. We also experimented with: (1) computing the overall update using all 5040 tasks (rather than averaging), but this was too memory expensive; and (2) computing the overall update using an average over the 5000 PT tasks and each of the 40 FT tasks individually, but this was unstable and did not converge.…”

Section: E2 Further Experimental Details E21 Baselinesmentioning

confidence: 99%

Meta-Learning to Improve Pre-Training

Raghu¹,

Lorraine²,

Kornblith³

et al. 2021

Preprint

View full text Add to dashboard Cite

Pre-training (PT) followed by fine-tuning (FT) is an effective method for training neural networks, and has led to significant performance improvements in many domains. PT can incorporate various design choices such as task and data reweighting strategies, augmentation policies, and noise models, all of which can significantly impact the quality of representations learned. The hyperparameters introduced by these strategies therefore must be tuned appropriately. However, setting the values of these hyperparameters is challenging. Most existing methods either struggle to scale to high dimensions, are too slow and memory-intensive, or cannot be directly applied to the two-stage PT and FT learning process. In this work, we propose an efficient, gradient-based algorithm to meta-learn PT hyperparameters. We formalize the PT hyperparameter optimization problem and propose a novel method to obtain PT hyperparameter gradients by combining implicit differentiation and backpropagation through unrolled optimization. We demonstrate that our method improves predictive performance on two real-world domains. First, we optimize high-dimensional task weighting hyperparameters for multitask pre-training on protein-protein interaction graphs and improve AUROC by up to 3.9%. Second, we optimize a data augmentation neural network for self-supervised PT with SimCLR on electrocardiography data and improve AUROC by up to 1.9%.The PT & FT paradigm introduces high-dimensional, complex PT hyperparameters, such as parameterized data augmentation policies used in contrastive representation learning [8,22] or the use of task, class, or instance weighting variables in multi-task PT to avoid negative transfer [70]. These hyperparameters can significantly affect the quality of pre-trained models [8], and thus finding techniques to set their values optimally is an important area of research.Choosing optimal PT hyperparameter values is challenging, and existing methods do not work well. Simple approaches such as random or grid search are inefficient since evaluating a hyperparameter setting requires performing the full, two-stage PT & FT optimization, which may be prohibitively computationally expensive. Gradient-free approaches, such as Bayesian optimization or evolutionary algorithms [33,61,47], are also limited in how well they scale to this setting. Gradient-based 35th Conference on Neural Information Processing Systems (NeurIPS 2021).

show abstract

Gradient Surgery for Multi-Task Learning

Cited by 84 publications

References 0 publications

GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting

GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting

Learning Multi-Objective Curricula for Deep Reinforcement Learning

Meta-Learning to Improve Pre-Training

Contact Info

Product

Resources

About