2021
DOI: 10.48550/arxiv.2110.11383
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Finite-Time Complexity of Online Primal-Dual Natural Actor-Critic Algorithm for Constrained Markov Decision Processes

Abstract: We consider a discounted cost constrained Markov decision process (CMDP) policy optimization problem, in which an agent seeks to maximize a discounted cumulative reward subject to a number of constraints on discounted cumulative utilities. To solve this constrained optimization program, we study an online actor-critic variant of a classic primal-dual method where the gradients of both the primal and dual functions are estimated using samples from a single trajectory generated by the underlying time-varying Mar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 39 publications
0
1
0
Order By: Relevance
“…It is relevant to examine whether similar techniques can improve convergence of single-time scale primal-dual algorithms and if constraint violation can be reduced to zero (Bai et al, 2022). Other open issues include addressing sample efficiency of policy gradient primal-dual algorithms in the presence of strategic exploration (Agarwal et al, 2020;Zanette et al, 2021;Zeng et al, 2021), reuse of off-policy samples, examining robustness against adversaries, as well as off-line policy optimization for constrained MDPs.…”
Section: Discussionmentioning
confidence: 99%
“…It is relevant to examine whether similar techniques can improve convergence of single-time scale primal-dual algorithms and if constraint violation can be reduced to zero (Bai et al, 2022). Other open issues include addressing sample efficiency of policy gradient primal-dual algorithms in the presence of strategic exploration (Agarwal et al, 2020;Zanette et al, 2021;Zeng et al, 2021), reuse of off-policy samples, examining robustness against adversaries, as well as off-line policy optimization for constrained MDPs.…”
Section: Discussionmentioning
confidence: 99%