Joint Optimization of Multi-Objective Reinforcement Learning with Policy Gradient Based Algorithm

Bai, Qinbo; Agarwal, Mridul; Aggarwal, Vaneet

doi:10.48550/arxiv.2105.14125

Cited by 1 publication

(4 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Global convergence of PG-based approaches in the multi-objective MDPs has been previously studied. For smooth concave scalarization, Bai et al [5] showed an O(1/ 4 ) sample complexity (to achieve -optimal in expectation) of the policy-gradient method under sample-based scenarios. However, with exact gradients, we are unaware of works with fast Õ(1/T ) convergence.…”

Section: Related Workmentioning

confidence: 99%

“…has zero gradient at θ = θ k . The update in (5) reduces to an NPG update on the unregularized value function Ṽ π θ rk (ρ). For single-objective MDPs, it reduces to the canonical NPG method.…”

Section: Notationsmentioning

confidence: 99%

“…2 +δ ≤ 0, ∀v 1:2 ∈ V. As mentioned in Section 3, when t k = 1, the update in (5) reduces to an NPG update on the unregularized value function V π θ rk (ρ). In other words, when t k = 1, α has no impact on the ARNPG-IMD algorithm.…”

Section: B11 Smooth Concave Scalarizationmentioning

confidence: 99%

“…Moffaert et al [26] studied a Chebyshev scalarization i.e., weighted L ∞ scalarization via a Q-learning approach. Bai et al [5] established an O(1/ 4 ) sample complexity for a policy-gradient method under smooth concave scalarization. Besides optimization (pure exploration), the exploration-exploitation trade-off has also been studied, where with concave scalarization, an O( √ T ) regret was established under the Lipschitz continuous assumption on F by [3,30].…”

Section: F More Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning

Zhou¹,

Liu²,

Kalathil³

et al. 2022

Preprint

View full text Add to dashboard Cite

We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions, which are to be jointly optimized according to given criteria such as proportional fairness (smooth concave scalarization), hard constraints (constrained MDP), and max-min trade-off. We propose an Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework, which can systematically incorporate ideas from well-performing first-order methods into the design of policy optimization algorithms for multi-objective MDP problems. Theoretically, the designed algorithms based on the ARNPG framework achieve Õ(1/T ) global convergence with exact gradients. Empirically, the ARNPG-guided algorithms also demonstrate superior performance compared to some existing policy gradient-based approaches in both exact gradients and sample-based scenarios.We study policy gradient-based approaches that optimize over parameterized policies Π = {π θ : θ ∈ Θ} through policy gradient. In general, the optimization problems above may not be convex in terms of θ, not even for single-objective MDPs with direct parameterization by θ s,a = π θ (a|s) [2]. Due to the nonconvexity, O(1/T ) global convergence of policy gradient-based methods was only established very recently for single-objective MDPs with exact gradients [2,21]. These breakthrough results have motivated the study of policy optimization for multi-objective MDPs, e.g., smooth concave scalarization [5], constrained MDPs (CMDPs) [11,31].However, under the exact gradients scenario, the previous approaches for multi-objective MDPs, either suffer from slow provable O(1/ √ T ) global convergence [11], or require extra assumptions [37,33,18]. The compactness of Θ is assumed in [37], but this assumption forbids a very common softmax parameterization, where Θ = R |S||A| . The NPG-based methods have been analyzed in [33,18] under an ergodicity assumption, but such an assumption is not required for NPG in single-objective MDPs [2], and therefore appears artificial.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Notationsmentioning

confidence: 99%

Section: B11 Smooth Concave Scalarizationmentioning

confidence: 99%