Probabilistic Policy Reuse for Safe Reinforcement Learning

Garcı́a, Javier; Fernández, Fernando

doi:10.1145/3310090

Cited by 8 publications

(6 citation statements)

References 28 publications

(52 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Safe reinforcement learning is an active field of research for which an extensive overview is given by Garcia et al [13]. Some approaches consider the setting in which safety must be learned through environmental interactions, which means safety constraints may be violated during training [7,25].…”

Section: Main Contributions and Related Workmentioning

confidence: 99%

Enforcing Hard State-Dependent Action Bounds on Deep Reinforcement Learning Policies

Cooman

Suykens

Ortseifen

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Imposing hard constraints on deep reinforcement learning policies trained with model-free methods is a challenging task. In this paper we specifically focus on constraining the policy's actions, by imposing state-dependent action bounds. Such bounds allow the designer to incorporate prior domain knowledge into the model-free learning framework and can be used to improve the stability or safety of the learned policies. The approach is applied to two benchmark environments and a more complicated autonomous driving problem. When correctly applied, the state-dependent action bounds can provide strong safety guarantees, as well as improve the convergence speed.

show abstract

Section: Main Contributions and Related Workmentioning

confidence: 99%

Enforcing Hard State-Dependent Action Bounds on Deep Reinforcement Learning Policies

Cooman

Suykens

Ortseifen

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Shown in Fig. 1b, states are labeled as known or unknown, depending on the agent having visited the state previously [24]. Finally, combining these concepts, the definition of a Safe State Space (SSS) and a Fatal State Space (FSS) can be given [25][26][27].…”

Section: Safe Learningmentioning

confidence: 99%

Safe Curriculum Learning for Linear Systems with Parametric Unknowns in Primary Flight Control

Buysscher

Pollack

Kampen

2022

AIAA SCITECH 2022 Forum

View full text Add to dashboard Cite

AIAA SciTech Forum controllers that are more flexible in their design process. Additionally, complex systems are often difficult to model and validate. Consequently, controllers that can be derived using data-driven methods are preferred in such cases. Advanced sensor-based controllers exist, such as incremental dynamic inversion controllers [1], to alleviate the burden of modelling complex systems. However, the aforementioned methods suffer from state reconstruction dependencies and synchronisation issues with the sensor data. Another approach to data driven controller design resides in machine learning paradigms such as Reinforcement Learning (RL). The fundamental principle used in RL is the representation of the world as an agent being confronted with a choice of action. The agent learns a control policy by interacting with the environment and gaining experience of the dynamics over time. Its basic principle is simple yet effective. However, with increasing task complexity (i.e., growing state and action spaces) increases, RL agents have a tendency to struggle learning a policy reliably [2]. Curriculum Learning (CurL) indroduced in [2], provides a structured approach to allow learning on more complex applications by dividing the initial task into sub-tasks [3,4]. This facilitates the agent's learning process and increases the likelihood of successfully finding a control policy [5]. Given the examples cited previously, certainly in transport applications where stringent (safety) requirements apply, the safety aspect of the learning process and the correct operation of the controller is of crucial importance. Unlike RL methods which in their simplest forms generally lack consideration of the safety aspect [6], Safe Learning (SL) does provide a framework to this end [7].The research outlined in this paper proposes a safe curriculum learning architecture that builds on the research presented in [8]. Here, the dependency on knowledge about an uncertain model for the safety algorithm is removed by complementing the paradigm in [8] with a system identification capability.First a brief introduction to the fields of RL, Curriculum Learning, Safe Learning, and system identification is provided in sections II.A, II.B, II.C and II.D, respectively. This is followed by a detailed presentation of the approach chosen in this research outlined in Section III. Finally, the proposed paradigm is tested through two experiments. Initially, a Mass-Spring-Damper (MSD) system is used to verify the architecture for which the results are presented in Section IV.A. In Section IV.B, the results of a the safe curriculum architecture applied on a quadrotor are outlined. The paper is concluded with a discussion of the results of the experiments, as well as a conclusion and recommendations for further research. II. Safe Curriculum Learning FrameworkThe core principles in safe curriculum learning are derived from three research fields: reinforcement learning, curriculum learning and safe learning. Inherently, the fundamentals originate from th...

show abstract

“…This issue of distributional shift is a well-studied problem in the literature (Fujimoto, Meger, and Precup 2019;Kumar et al 2019Kumar et al , 2020. Several solutions have been proposed, such as reverting to a safe policy (Richter and Roy 2017), forcefully resetting the agent (Ainsworth, Barnes, and Srinivasa 2019), or requesting human intervention (Laskey et al 2016;García and Fernández 2019). In our problem definition, we assume that a safe policy is not known outside of the expert trajectories provided, that resetting the agent is not possible and that human supervision in the true environment is very costly and therefore undesirable.…”

Section: Related Workmentioning

confidence: 99%

Abstraction-Guided Policy Recovery from Expert Demonstrations

Ponnambalam

Oliehoek

Spaan

2021

ICAPS

View full text Add to dashboard Cite

Behavior cloning is a method of automated decision-making that aims to extract meaningful information from expert demonstrations and reproduce the same behavior autonomously. It is unlikely that demonstrations will exhaustively cover the potential problem space, compromising the quality of automation when out-of-distribution states are encountered. Our approach RECO jointly learns both an imitation policy and recovery policy from expert data. The recovery policy steers the agent from unknown states back to the demonstrated states in the data set. While there is, per definition, no data available to learn the recovery policy, we exploit abstractions to generalize beyond the available data and simulate the recovery problem. When the most appropriate abstraction for the given data is unknown, our method selects the best recovery policy from a set generated by several candidate abstractions. In tabular domains, where we assume an agent must call to a human supervisor for help if it is in an unknown state, we show how RECO results in drastically fewer calls without compromising solution quality and with relatively few trajectories provided by an expert. We also introduce a continuous adaptation of our method and demonstrate the ability of RECO to recover an agent from states where its supervised learning-based imitation policy would otherwise fail.

show abstract

Probabilistic Policy Reuse for Safe Reinforcement Learning

Cited by 8 publications

References 28 publications

Enforcing Hard State-Dependent Action Bounds on Deep Reinforcement Learning Policies

Enforcing Hard State-Dependent Action Bounds on Deep Reinforcement Learning Policies

Safe Curriculum Learning for Linear Systems with Parametric Unknowns in Primary Flight Control

Abstraction-Guided Policy Recovery from Expert Demonstrations

Contact Info

Product

Resources

About