2018
DOI: 10.48550/arxiv.1801.08757
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Safe Exploration in Continuous Action Spaces

Abstract: We address the problem of deploying a reinforcement learning (RL) agent on a physical system such as a datacenter cooling unit or robot, where critical constraints must never be violated. We show how to exploit the typically smooth dynamics of these systems and enable RL algorithms to never violate constraints during learning. Our technique is to directly add to the policy a safety layer that analytically solves an action correction formulation per each state. The novelty of obtaining an elegant closed-form so… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
174
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 117 publications
(174 citation statements)
references
References 16 publications
0
174
0
Order By: Relevance
“…Recent methods in [12,13,24,25] address the safety problem by similar shielding schemes as proposed in this work. However, the major differences lie in how the safety state-action value function, safety critic, is trained and where the backup actions come from.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Recent methods in [12,13,24,25] address the safety problem by similar shielding schemes as proposed in this work. However, the major differences lie in how the safety state-action value function, safety critic, is trained and where the backup actions come from.…”
Section: Related Workmentioning
confidence: 99%
“…However, the major differences lie in how the safety state-action value function, safety critic, is trained and where the backup actions come from. Dalal et al [24] assume the safety of systems can be ensured by adjusting the action in a single time step (no long-term effect). Thus, they learn a linear safety-signal model and formulate a quadratic program to find the closest control to the reference control such that the safety constraints are satisfied.…”
Section: Related Workmentioning
confidence: 99%
“…Safe RL Our work is broadly related to the safe reinforcement learning and control literature; we refer interested readers to (Garcıa and Fernández 2015;Brunke et al 2021) for surveys on this topic. A popular class of approaches incorporates Lagrangian constraint regularization into the policy updates in policy-gradient algorithms (Achiam et al 2017;Ray, Achiam, and Amodei 2019;Tessler, Mankowitz, and Mannor 2018;Dalal et al 2018;Cheng et al 2019;Zhang, Vuong, and Ross 2020;Chow et al 2019). These methods build on model-free deep RL algorithms (Schulman et al 2017b,a), which are often sample-inefficient, and do not guarantee that intermediate policies during training are safe.…”
Section: Related Workmentioning
confidence: 99%
“…Most of these works tackle CMDPs from the perspective of Safe RL, which seeks to minimize the total regret over the cost functions throughout training (Ray, Achiam, and Amodei 2019) and focus on the single-constraint case (Zhang, Vuong, and Ross 2020;Dalal et al 2018;Calian et al 2020) or aggregate various types of events under a single constraint (Stooke, Achiam, and Abbeel 2020;Ray, Achiam, and Amodei 2019). In this work, we focus our attention on the potential of CMDPs for precise and intuitive behavior specification and work on the problem of satisfying many constraints simultaneously.…”
Section: Related Work Constrained Reinforcement Learningmentioning
confidence: 99%