2021
DOI: 10.48550/arxiv.2112.00885
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Safe Exploration for Constrained Reinforcement Learning with Provable Guarantees

Abstract: We consider the problem of learning an episodic safe control policy that minimizes an objective function, while satisfying necessary safety constraints -both during learning and deployment. We formulate this safety constrained reinforcement learning (RL) problem using the framework of a finite-horizon Constrained Markov Decision Process (CMDP) with an unknown transition probability function. Here, we model the safety requirements as constraints on the expected cumulative costs that must be satisfied during all… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 15 publications
0
5
0
Order By: Relevance
“…The Slater's condition is commonly assumed in previous works to ensure the LP problem has strong duality, see proofs in (Paternain et al 2019(Paternain et al , 2022. Unlike (Ding et al 2021;Efroni, Mannor, and Pirotta 2020), which assume δ is known, and (Achiam et al 2017;Liu et al 2021a;Bura et al 2021), where a strictly feasible policy is given, our assumption is less restrictive. Let…”
Section: Assumption 1 (Slater's Conditionmentioning
confidence: 98%
See 1 more Smart Citation
“…The Slater's condition is commonly assumed in previous works to ensure the LP problem has strong duality, see proofs in (Paternain et al 2019(Paternain et al , 2022. Unlike (Ding et al 2021;Efroni, Mannor, and Pirotta 2020), which assume δ is known, and (Achiam et al 2017;Liu et al 2021a;Bura et al 2021), where a strictly feasible policy is given, our assumption is less restrictive. Let…”
Section: Assumption 1 (Slater's Conditionmentioning
confidence: 98%
“…Two very recent works (Liu et al 2021a;Wei, Liu, and Ying 2021) show that sublinear regret bound and zero violation are possible for episodic CMDPs without simulators. In particular, (Liu et al 2021a) proposes a model-based algorithm and (Wei, Liu, and Ying 2022) presents a model-free algorithm, and (Bura et al 2021) proves that it is possible to achieve zero violation during training given a safe baseline policy based on a model-based approach. Despite these significant developments, the following question is still open:…”
Section: Introductionmentioning
confidence: 99%
“…Safe RL, especially those with expected cumulative constraints, has been extensively studied under model-free approaches (Wei, Liu, and Ying 2022b,a;Wei et al 2023;Ghosh, Zhou, and Shroff 2022), and model-based approaches (Ding et al 2021;Liu et al 2021a;Bura et al 2021;Singh, Gupta, and Shroff 2020;Ding et al 2021;Chen, Jain, and Luo 2022). There are also many works (Liu, Jiang, and Li 2022;Wu et al 2018;Caramanis, Dimitrov, and Morton 2014) that have studied the knapsack constraints, wherein the learning process stops whenever the budget has run out.…”
Section: Related Workmentioning
confidence: 99%
“…RL with instantaneous constraints addresses this need by introducing constraints that the agent must adhere to at every single time step during the learning process. Unlike constraints imposed on the entire trajectory or episode (Wei, Liu, and Ying 2022a,b;Ghosh, Zhou, and Shroff 2022;Ding et al 2021;Liu et al 2021a;Bura et al 2021;Wei et al 2023;Singh, Gupta, and Shroff 2020;Ding et al 2021;Chen, Jain, and Luo 2022;Efroni, Mannor, and Pirotta 2020), instantaneous constraints demand strict compliance with specified limitations at each moment of decision-making so that unsafe actions should be avoided at each step. For instance, in autonomous vehicles, RL agents must consistently adhere to traffic rules and avoid dangerous maneuvers in any time to ensure safety.…”
Section: Introductionmentioning
confidence: 99%
“…The control-based approaches, e.g., [6,8,24], adopt barrier functions to keep the agent away from unsafe regions and impose hard constraints. There are also other works which consider zero-violation [16,7], and step-wise constraints with the assumption on safe actions [3].…”
Section: Related Workmentioning
confidence: 99%