2022
DOI: 10.48550/arxiv.2201.11965
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints

Abstract: We consider primal-dual-based reinforcement learning (RL) in episodic constrained Markov decision processes (CMDPs) with non-stationary objectives and constraints, which play a central role in ensuring the safety of RL in timevarying environments. In this problem, the reward/utility functions and the state transition functions are both allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain known variation budgets. Designing safe RL algorithms in time-varying environm… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
2

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 20 publications
0
6
0
Order By: Relevance
“…Different from the general concave case (18), the bound (21) does not contain the constant error term O(ε). Thus, by choosing η 2 = T −1 2 , the average performance has the order O T −1 2 .…”
Section: Assumption 41 (Parameterization)mentioning
confidence: 99%
See 1 more Smart Citation
“…Different from the general concave case (18), the bound (21) does not contain the constant error term O(ε). Thus, by choosing η 2 = T −1 2 , the average performance has the order O T −1 2 .…”
Section: Assumption 41 (Parameterization)mentioning
confidence: 99%
“…CMDP Our work is also pertinent to policy-based CMDP algorithms [10,[19][20][21][22][23]. In particular, [13] develops a natural policy gradient-based primal-dual algorithm and shows that it enjoys an O(T −1 2 ) global convergence rate regarding both the optimality gap and the constraint violation under the soft-max parameterization.…”
Section: Related Workmentioning
confidence: 99%
“…CMDP Our work is also pertinent to policy-based CMDP algorithms (Altman 1999;Borkar 2005;Achiam et al 2017;Ding and Lavaei 2022;Chow et al 2017;Efroni, Mannor, and Pirotta 2020). In particular, Ding et al (2020) develops a natural policy gradient-based primal-dual algorithm and shows that it enjoys an O(T −1/2 ) global convergence rate regarding both the optimality gap and the constraint violation under the standard soft-max parameterization.…”
Section: Related Workmentioning
confidence: 99%
“…Non-stationary RL has been mostly studied in the risk-neutral setting. When the variation budget is known a prior, a common strategy for adapting to the non-stationarity is to follow the forgetting principle, such as the restart strategy (Mao et al 2020;Zhou et al 2020;Zhao et al 2020;Ding and Lavaei 2022), exponential decayed weights (Touati and Vincent 2020), or sliding window (Cheung, Simchi-Levi, and Zhu 2020;Zhong, Yang, and Szepesvári 2021). In this work, we focus on the restart method mainly due to its advantage of the simplicity of the the memory efficiency (Zhao et al 2020) and generalize it to the risk-sensitive RL setting.…”
Section: Related Workmentioning
confidence: 99%