2021
DOI: 10.48550/arxiv.2105.14125
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Joint Optimization of Multi-Objective Reinforcement Learning with Policy Gradient Based Algorithm

Abstract: Many engineering problems have multiple objectives, and the overall aim is to optimize a non-linear function of these objectives. In this paper, we formulate the problem of maximizing a non-linear concave function of multiple long-term objectives. A policy-gradient based model-free algorithm is proposed for the problem. To compute an estimate of the gradient, a biased estimator is proposed. The proposed algorithm is shown to achieve convergence to within an of the global optima after sampling O( M 4 σ 2(1−γ) 8… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(4 citation statements)
references
References 22 publications
0
4
0
Order By: Relevance
“…Global convergence of PG-based approaches in the multi-objective MDPs has been previously studied. For smooth concave scalarization, Bai et al [5] showed an O(1/ 4 ) sample complexity (to achieve -optimal in expectation) of the policy-gradient method under sample-based scenarios. However, with exact gradients, we are unaware of works with fast Õ(1/T ) convergence.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…Global convergence of PG-based approaches in the multi-objective MDPs has been previously studied. For smooth concave scalarization, Bai et al [5] showed an O(1/ 4 ) sample complexity (to achieve -optimal in expectation) of the policy-gradient method under sample-based scenarios. However, with exact gradients, we are unaware of works with fast Õ(1/T ) convergence.…”
Section: Related Workmentioning
confidence: 99%
“…has zero gradient at θ = θ k . The update in (5) reduces to an NPG update on the unregularized value function Ṽ π θ rk (ρ). For single-objective MDPs, it reduces to the canonical NPG method.…”
Section: Notationsmentioning
confidence: 99%
See 2 more Smart Citations