Decentralized multi-task reinforcement learning policy gradient method with momentum over networks

Shi, Junru; Wang, Qiong; Liu, Muhua; Ji, Zhihang; Zheng, Ruijuan; Wu, Qingtao

doi:10.1007/s10489-022-04028-8

Cited by 3 publications

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multi-task RL in general studies efficiently solving the policy optimization tasks for multiple RL environments at the same time by leveraging connections between the tasks. Its most common mathematical formulation is to find a single policy that maximizes the (weighted) average of the cumulative returns collected across all environments, and Zeng et al (2021); Jiang et al (2022); Junru et al (2022); Chen et al (2022a) study various gradient-based algorithms that provably converge to global or local solutions of this objective. However, as pointed out in Hessel et al (2019), this average return formulation can be inadequate when modelling practical problems where the tasks have strong conflicting or imbalanced interests.…”

Section: Related Workmentioning

confidence: 99%

“…We note that Eq. ( 3) obviously subsumes the non-constraint multi-task formulation in Zeng et al (2021); Jiang et al (2022); Junru et al (2022) by properly choosing tℓ i , u i u. It can be shown that the multi-task policy optimization problem (even without constraints) does not observe the gradient domination condition in general, which makes it difficult for any gradient-based algorithm to find the globally optimal policy.…”

Section: Given a Policy π P ∆ Simentioning

confidence: 99%

See 1 more Smart Citation

Finite-Time Complexity of Online Primal-Dual Natural Actor-Critic Algorithm for Constrained Markov Decision Processes

Zeng

Doan

Romberg

2022

2022 IEEE 61st Conference on Decision and Control (CDC)

View full text Add to dashboard Cite

Multi-task reinforcement learning (RL) aims to find a single policy that effectively solves multiple tasks at the same time. This paper presents a constrained formulation for multitask RL where the goal is to maximize the average performance of the policy across tasks subject to bounds on the performance in each task. We consider solving this problem both in the centralized setting, where information for all tasks is accessible to a single server, and in the decentralized setting, where a network of agents, each given one task and observing local information, cooperate to find the solution of the globally constrained objective using local communication.We first propose a primal-dual algorithm that provably converges to the globally optimal solution of this constrained formulation under exact gradient evaluations. When the gradient is unknown, we further develop a sampled-based actor-critic algorithm that finds the optimal policy using online samples of state, action, and reward. Finally, we study the extension of the algorithm to the linear function approximation setting. IntroductionMulti-task reinforcement learning (RL) aims to find a common policy that effectively solves a range of tasks simultaneously, where each task is the policy optimization problem defined over a Markov decision process (MDP). The MDPs can have different state spaces, reward functions, and transition kernels in general, but may be implicitly or explicitly correlated.The most common mathematical formulation for multi-task RL is to maximize the average cumulative rewards collected by a single policy across all MDPs Zeng et al. ( 2021); Jiang et al. (2022); Junru et al. (2022). In this paper, we study a generalized formulation in which we maximize the average cumulative rewards subject to constraints on the performance of the policy for each task. This formulation is a special case of the policy optimization problem for a constrained Markov decision process (CMDP) Altman (1999) and is a flexible framework that allows more fine-grained specification of the performance of the optimal policy in each task. In applications where the tasks exhibit major conflicts of interest and/or the magnitude of the rewards varies significantly across tasks Kalashnikov et al. (2021); Guo et al. (2022), the optimal policy under the average-cumulative-reward formulation may excel in some tasks at the cost of compromised performance in others Hayes et al. (2022). The constrained formulation provides a way to mitigate this task imbalance. Illustrative numerical simulations are given in Section 7.Under the constrained multi-task formulation, we consider centralized and decentralized learning paradigms. "Centralized" in this context means that information of all tasks is available at a single server, while "decen-

show abstract