Multi-task reinforcement learning (RL) aims to find a single policy that effectively solves multiple tasks at the same time. This paper presents a constrained formulation for multitask RL where the goal is to maximize the average performance of the policy across tasks subject to bounds on the performance in each task. We consider solving this problem both in the centralized setting, where information for all tasks is accessible to a single server, and in the decentralized setting, where a network of agents, each given one task and observing local information, cooperate to find the solution of the globally constrained objective using local communication.We first propose a primal-dual algorithm that provably converges to the globally optimal solution of this constrained formulation under exact gradient evaluations. When the gradient is unknown, we further develop a sampled-based actor-critic algorithm that finds the optimal policy using online samples of state, action, and reward. Finally, we study the extension of the algorithm to the linear function approximation setting.
IntroductionMulti-task reinforcement learning (RL) aims to find a common policy that effectively solves a range of tasks simultaneously, where each task is the policy optimization problem defined over a Markov decision process (MDP). The MDPs can have different state spaces, reward functions, and transition kernels in general, but may be implicitly or explicitly correlated.The most common mathematical formulation for multi-task RL is to maximize the average cumulative rewards collected by a single policy across all MDPs Zeng et al. ( 2021); Jiang et al. (2022); Junru et al. (2022). In this paper, we study a generalized formulation in which we maximize the average cumulative rewards subject to constraints on the performance of the policy for each task. This formulation is a special case of the policy optimization problem for a constrained Markov decision process (CMDP) Altman (1999) and is a flexible framework that allows more fine-grained specification of the performance of the optimal policy in each task. In applications where the tasks exhibit major conflicts of interest and/or the magnitude of the rewards varies significantly across tasks Kalashnikov et al. (2021); Guo et al. (2022), the optimal policy under the average-cumulative-reward formulation may excel in some tasks at the cost of compromised performance in others Hayes et al. (2022). The constrained formulation provides a way to mitigate this task imbalance. Illustrative numerical simulations are given in Section 7.Under the constrained multi-task formulation, we consider centralized and decentralized learning paradigms. "Centralized" in this context means that information of all tasks is available at a single server, while "decen-