2021
DOI: 10.48550/arxiv.2106.02684
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Abstract: We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of Õ( √ K) while allowing an Õ( √ K) constraint violation in K episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
18
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(18 citation statements)
references
References 13 publications
0
18
0
Order By: Relevance
“…The study of RL algorithms for CMDPs has received considerable attention due to the safety requirement (Altman, 1999;Paternain et al, 2019;Yu et al, 2019;Dulac-Arnold et al, 2019;Garcıa & Fernández, 2015). Our work is closely related to Lagrangian-based CMDP algorithms with optimistic policy evaluations (Efroni et al, 2020;Singh et al, 2020;Ding et al, 2021;Liu et al, 2021;Qiu et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…The study of RL algorithms for CMDPs has received considerable attention due to the safety requirement (Altman, 1999;Paternain et al, 2019;Yu et al, 2019;Dulac-Arnold et al, 2019;Garcıa & Fernández, 2015). Our work is closely related to Lagrangian-based CMDP algorithms with optimistic policy evaluations (Efroni et al, 2020;Singh et al, 2020;Ding et al, 2021;Liu et al, 2021;Qiu et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…Ding et al (2021) generalize the above results to the linear kernel CMDPs. Under some mild conditions and additional computation cost, Liu et al (2021) propose two algorithms to learn policies with a zero or bounded constraint violation for CMDPs. Beyond the stationary CMDP, Qiu et al (2020) consider the online CMDPs where only the rewards in objective can vary over episodes.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations