Safe Exploration for Constrained Reinforcement Learning with Provable Guarantees

Bura, Archana; HasanzadeZonuzy, Aria; Kalathil, Dileep; Shakkottai, Srinivas; Jean-Francois, Chamberland,

doi:10.48550/arxiv.2112.00885

Cited by 4 publications

(5 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Slater's condition is commonly assumed in previous works to ensure the LP problem has strong duality, see proofs in (Paternain et al 2019(Paternain et al , 2022. Unlike (Ding et al 2021;Efroni, Mannor, and Pirotta 2020), which assume δ is known, and (Achiam et al 2017;Liu et al 2021a;Bura et al 2021), where a strictly feasible policy is given, our assumption is less restrictive. Let…”

Section: Assumption 1 (Slater's Conditionmentioning

confidence: 98%

“…Two very recent works (Liu et al 2021a;Wei, Liu, and Ying 2021) show that sublinear regret bound and zero violation are possible for episodic CMDPs without simulators. In particular, (Liu et al 2021a) proposes a model-based algorithm and (Wei, Liu, and Ying 2022) presents a model-free algorithm, and (Bura et al 2021) proves that it is possible to achieve zero violation during training given a safe baseline policy based on a model-based approach. Despite these significant developments, the following question is still open:…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Provably-Efficient Model-Free Algorithm for Infinite-Horizon Average-Reward Constrained Markov Decision Processes

Wei

Liu

Lü

2022

AAAI

View full text Add to dashboard Cite

This paper presents a model-free reinforcement learning (RL) algorithm for infinite-horizon average-reward Constrained Markov Decision Processes (CMDPs). Considering a learning horizon K, which is sufficiently large, the proposed algorithm achieves sublinear regret and zero constraint violation. The bounds depend on the number of states S, the number of actions A, and two constants which are independent of the learning horizon K.

show abstract

Section: Assumption 1 (Slater's Conditionmentioning

confidence: 98%

Section: Introductionmentioning

confidence: 99%

A Provably-Efficient Model-Free Algorithm for Infinite-Horizon Average-Reward Constrained Markov Decision Processes

Wei

Liu

Lü

2022

AAAI

View full text Add to dashboard Cite

show abstract

“…Safe RL, especially those with expected cumulative constraints, has been extensively studied under model-free approaches (Wei, Liu, and Ying 2022b,a;Wei et al 2023;Ghosh, Zhou, and Shroff 2022), and model-based approaches (Ding et al 2021;Liu et al 2021a;Bura et al 2021;Singh, Gupta, and Shroff 2020;Ding et al 2021;Chen, Jain, and Luo 2022). There are also many works (Liu, Jiang, and Li 2022;Wu et al 2018;Caramanis, Dimitrov, and Morton 2014) that have studied the knapsack constraints, wherein the learning process stops whenever the budget has run out.…”

Section: Related Workmentioning

confidence: 99%

“…RL with instantaneous constraints addresses this need by introducing constraints that the agent must adhere to at every single time step during the learning process. Unlike constraints imposed on the entire trajectory or episode (Wei, Liu, and Ying 2022a,b;Ghosh, Zhou, and Shroff 2022;Ding et al 2021;Liu et al 2021a;Bura et al 2021;Wei et al 2023;Singh, Gupta, and Shroff 2020;Ding et al 2021;Chen, Jain, and Luo 2022;Efroni, Mannor, and Pirotta 2020), instantaneous constraints demand strict compliance with specified limitations at each moment of decision-making so that unsafe actions should be avoided at each step. For instance, in autonomous vehicles, RL agents must consistently adhere to traffic rules and avoid dangerous maneuvers in any time to ensure safety.…”

Section: Introductionmentioning

confidence: 99%

Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration

Wei,

Liu,

Ying

2024

AAAI

View full text Add to dashboard Cite

This paper studies safe Reinforcement Learning (safe RL) with linear function approximation and under hard instantaneous constraints where unsafe actions must be avoided at each step. Existing studies have considered safe RL with hard instantaneous constraints, but their approaches rely on several key assumptions: (i) the RL agent knows a safe action set for every state or knows a safe graph in which all the state-action-state triples are safe, and (ii) the constraint/cost functions are linear. In this paper, we consider safe RL with instantaneous hard constraints without assumption (i) and generalize (ii) to Reproducing Kernel Hilbert Space (RKHS). Our proposed algorithm, LSVI-AE, achieves O(√{d³H⁴K}) regret and O(H √{dK}) hard constraint violation when the cost function is linear and O(H?ₖ √{K}) hard constraint violation when the cost function belongs to RKHS. Here K is the learning horizon, H is the length of each episode, and ?ₖ is the information gain w.r.t the kernel used to approximate cost functions. Our results achieve the optimal dependency on the learning horizon K, matching the lower bound we provide in this paper and demonstrating the efficiency of LSVI-AE. Notably, the design of our approach encourages aggressive policy exploration, providing a unique perspective on safe RL with general cost functions and no prior knowledge of safe actions, which may be of independent interest.

show abstract

“…The control-based approaches, e.g., [6,8,24], adopt barrier functions to keep the agent away from unsafe regions and impose hard constraints. There are also other works which consider zero-violation [16,7], and step-wise constraints with the assumption on safe actions [3].…”

Section: Related Workmentioning

confidence: 99%

Provably Safe Reinforcement Learning with Step-wise Violation Constraints

Nuoya¹,

du²,

huang³

2023

Preprint

View full text Add to dashboard Cite

In this paper, we investigate a novel safe reinforcement learning problem with step-wise violation constraints. Our problem differs from existing works in that we consider stricter step-wise violation constraints and do not assume the existence of safe actions, making our formulation more suitable for safety-critical applications which need to ensure safety in all decision steps and may not always possess safe actions, e.g., robot control and autonomous driving. We propose a novel algorithm SUCBVI, which guarantees O( √ ST ) step-wise violation and O( √ H 3 SAT ) regret. Lower bounds are provided to validate the optimality in both violation and regret performance with respect to S and T . Moreover, we further study a novel safe reward-free exploration problem with step-wise violation constraints. For this problem, we design an (ε, δ)-PAC algorithm SRF-UCRL, which achieves nearly state-of-the-art sample complexity O(( S 2 AH 2 ε + H 4 SA ε 2 )(log( 1 δ ) + S)), and guarantees O( √ ST ) violation during the exploration. The experimental results demonstrate the superiority of our algorithms in safety performance, and corroborate our theoretical results.

show abstract

Safe Exploration for Constrained Reinforcement Learning with Provable Guarantees

Cited by 4 publications

References 15 publications

A Provably-Efficient Model-Free Algorithm for Infinite-Horizon Average-Reward Constrained Markov Decision Processes

A Provably-Efficient Model-Free Algorithm for Infinite-Horizon Average-Reward Constrained Markov Decision Processes

Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration

Provably Safe Reinforcement Learning with Step-wise Violation Constraints

Contact Info

Product

Resources

About