Safe Exploration in Continuous Action Spaces

Dalal, Gal; Dvijotham, Krishnamurthy; Vecerík, Matej; Hester, Todd; Păduraru, Cosmin; Tassa, Yuval

doi:10.48550/arxiv.1801.08757

Cited by 117 publications

(174 citation statements)

References 16 publications

Supporting

Mentioning

174

Contrasting

Order By: Relevance

“…Recent methods in [12,13,24,25] address the safety problem by similar shielding schemes as proposed in this work. However, the major differences lie in how the safety state-action value function, safety critic, is trained and where the backup actions come from.…”

Section: Related Workmentioning

confidence: 99%

“…However, the major differences lie in how the safety state-action value function, safety critic, is trained and where the backup actions come from. Dalal et al [24] assume the safety of systems can be ensured by adjusting the action in a single time step (no long-term effect). Thus, they learn a linear safety-signal model and formulate a quadratic program to find the closest control to the reference control such that the safety constraints are satisfied.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees

Hsu¹,

Z.²,

Nguyen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Safety is a critical component of autonomous systems and remains a challenge for learning-based policies to be utilized in the real world. In particular, policies learned using reinforcement learning often fail to generalize to novel environments due to unsafe behavior. In this paper, we propose Sim-to-Lab-to-Real to safely close the reality gap. To improve safety, we apply a dual policy setup where a performance policy is trained using the cumulative task reward and a backup (safety) policy is trained by solving the reach-avoid Bellman Equation based on Hamilton-Jacobi reachability analysis. In Sim-to-Lab transfer, we apply a supervisory control scheme to shield unsafe actions during exploration; in Lab-to-Real transfer, we leverage the Probably Approximately Correct (PAC)-Bayes framework to provide lower bounds on the expected performance and safety of policies in unseen environments. We empirically study the proposed framework for ego-vision navigation in two types of indoor environments including a photo-realistic one. We also demonstrate strong generalization performance through hardware experiments in real indoor spaces with a quadrupedal robot. See https://sites.google.com/princeton.edu/sim-to-lab-to-real for supplementary material.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees

Hsu¹,

Z.²,

Nguyen³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Safe RL Our work is broadly related to the safe reinforcement learning and control literature; we refer interested readers to (Garcıa and Fernández 2015;Brunke et al 2021) for surveys on this topic. A popular class of approaches incorporates Lagrangian constraint regularization into the policy updates in policy-gradient algorithms (Achiam et al 2017;Ray, Achiam, and Amodei 2019;Tessler, Mankowitz, and Mannor 2018;Dalal et al 2018;Cheng et al 2019;Zhang, Vuong, and Ross 2020;Chow et al 2019). These methods build on model-free deep RL algorithms (Schulman et al 2017b,a), which are often sample-inefficient, and do not guarantee that intermediate policies during training are safe.…”

Section: Related Workmentioning

confidence: 99%

Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning

Ma¹,

Shen²,

Bastani³

et al. 2021

Preprint

View full text Add to dashboard Cite

Reinforcement Learning (RL) agents in the real world must satisfy safety constraints in addition to maximizing a reward objective. Model-based RL algorithms hold promise for reducing unsafe real-world actions: they may synthesize policies that obey all constraints using simulated samples from a learned model. However, imperfect models can result in realworld constraint violations even for actions that are predicted to satisfy all constraints. We propose Conservative and Adaptive Penalty (CAP), a model-based safe RL framework that accounts for potential modeling errors by capturing model uncertainty and adaptively exploiting it to balance the reward and the cost objectives. First, CAP inflates predicted costs using an uncertainty-based penalty. Theoretically, we show that policies that satisfy this conservative cost constraint are guaranteed to also be feasible in the true environment. We further show that this guarantees the safety of all intermediate solutions during RL training. Further, CAP adaptively tunes this penalty during training using true cost feedback from the environment. We evaluate this conservative and adaptive penalty-based approach for model-based safe RL extensively on state and image-based environments. Our results demonstrate substantial gains in sample-efficiency while incurring fewer violations than prior safe RL algorithms. Code is available at: https://github.com/Redrew/CAP

show abstract

“…Most of these works tackle CMDPs from the perspective of Safe RL, which seeks to minimize the total regret over the cost functions throughout training (Ray, Achiam, and Amodei 2019) and focus on the single-constraint case (Zhang, Vuong, and Ross 2020;Dalal et al 2018;Calian et al 2020) or aggregate various types of events under a single constraint (Stooke, Achiam, and Abbeel 2020;Ray, Achiam, and Amodei 2019). In this work, we focus our attention on the potential of CMDPs for precise and intuitive behavior specification and work on the problem of satisfying many constraints simultaneously.…”

Section: Related Work Constrained Reinforcement Learningmentioning

confidence: 99%

Direct Behavior Specification via Constrained Reinforcement Learning

Roy¹,

Girgis²,

Romoff³

et al. 2021

Preprint

View full text Add to dashboard Cite

The standard formulation of Reinforcement Learning lacks a practical way of specifying what are admissible and forbidden behaviors. Most often, practitioners go about the task of behavior specification by manually engineering the reward function, a counter-intuitive process that requires several iterations and is prone to reward hacking by the agent. In this work, we argue that constrained RL, which has almost exclusively been used for safe RL, also has the potential to significantly reduce the amount of work spent for reward specification in applied Reinforcement Learning projects. To this end, we propose to specify behavioral preferences in the CMDP framework and to use Lagrangian methods, which seek to solve a min-max problem between the agent's policy and the Lagrangian multipliers, to automatically weigh each of the behavioral constraints. Specifically, we investigate how CMDPs can be adapted in order to solve goal-based tasks while adhering to a set of behavioral constraints and propose modifications to the SAC-Lagrangian algorithm to handle the challenging case of several constraints. We evaluate this framework on a set of continuous control tasks relevant to the application of Reinforcement Learning for NPC design in video games.

show abstract

Safe Exploration in Continuous Action Spaces

Cited by 117 publications

References 16 publications

Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees

Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees

Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning

Direct Behavior Specification via Constrained Reinforcement Learning

Contact Info

Product

Resources

About