Safety-constrained reinforcement learning with a distributional safety critic

Yang, Qisong; Simão, Thiago D.; Tindemans, Simon H.; Spaan, Matthijs T. J.

doi:10.1007/s10994-022-06187-8

Cited by 18 publications

(26 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our approach adopts the policy reuse strategy that directly leverages a guide policy to sample trajectories, which facilitates rapid adaptation to a new task (Rosman et al, 2016;Fernández and Veloso, 2006). This strategy leads to better initial trajectories and improves the jump-start by providing a strong initial point for the learning algorithm Yang et al, 2022). The guide policy can take the form of a rule-based policy, expert policy, or well-trained policy (Ayeelyan et al, 2022).…”

Section: Policy Reusementioning

confidence: 99%

Subtask-masked curriculum learning for reinforcement learning with application to UAV maneuver decision-making

Hou

Liang

et al. 2023

Engineering Applications of Artificial Intelligence

View full text Add to dashboard Cite

Section: Policy Reusementioning

confidence: 99%

Subtask-masked curriculum learning for reinforcement learning with application to UAV maneuver decision-making

Hou

Liang

et al. 2023

Engineering Applications of Artificial Intelligence

View full text Add to dashboard Cite

“…These approaches mainly differ in terms of the way they parameterize the return distribution and the distance metric that is used to measure the difference between two distributions. In this work, we follow the authors of [Yang et al 2023] and utilize the implicit quantile network (IQN) [Dabney et al 2018a] to approximate the cost return distribution.…”

Section: Distributional Reinforcement Learningmentioning

confidence: 99%

“…Worst-case soft actor-critic (WCSAC) [Yang et al 2023] is a soft actor-critic (SAC) [Haarnoja et al 2018a, Haarnoja et al 2018b] based algorithm that uses a distributional safety critic to produce risk-averse behavior. To this end, the upper tail of the estimated distribution is used.…”

Section: Risk-averse Safe Reinforcement Learningmentioning

confidence: 99%

“…Our reasons for adopting a distributional perspective on the safety critic are twofold. First, inspired by the authors of [Yang et al 2023], we believe that it is particularly interesting to be able to produce risk-averse behavior in the context of safe reinforcement learning. Second, the use of a distributional safety critic can increase the accuracy of the predictions of cost returns, which, in turn, allows the agent to more accurately trade-off reward and cost in the CMDP setting.…”

Section: Distributional Safe Stochastic Latent Actor-criticmentioning

confidence: 99%

See 1 more Smart Citation

Distributional Safety Critic for Stochastic Latent Actor-Critic

Miranda,

Bernardino

2023

Anais Do XX Encontro Nacional De Inteligência Artificial E Computacional (ENIAC 2023)

View full text Add to dashboard Cite

When employing reinforcement learning techniques in real-world applications, one may desire to constrain the agent by limiting actions that lead to potential damage, harm, or unwanted scenarios. Particularly, recent approaches focus on developing safe behavior under partial observability conditions. In this vein, we develop a method that combines distributional reinforcement learning techniques with methods used to facilitate learning in partially observable environments, called distributional safe stochastic latent actor-critic (DS-SLAC). We evaluate the DS-SLAC performance on four Safety-Gym tasks and DS-SLAC obtained results better than those reached by state-of-the-art algorithms in two of the evaluated environments while being able to develop a safe policy in three of them. Lastly, we also identify the main challenges of performing distributional reinforcement learning in the safety-constrained partially observable setting.

show abstract

“…Apart from supervised and self-supervised learning, approaches such as meta-learning (Kirsch, van Steenkiste, and Schmidhuber 2020;Guo, Wu, and Lee 2022), transfer learning (Guo et al 2019;Vrbančič and Podgorelec 2020) and curriculum learning (Bengio et al 2009;Park and Park 2022;Hu et al 2022) have also demonstrated their capacity to adapt pre-trained policies to novel environments through retraining (Packer et al 2019). Integration of techniques such safe RL is recommended within the training process to allow the agent to avoid hazardous states, thereby ensuring consistent normal performance and enhancing training efficiency (Yang et al 2021(Yang et al , 2022. When confronted with expanded state space, the latter category of solutions that retrain learned policies is often favored.…”

Section: Introductionmentioning

confidence: 99%

Analyzing Generalization in Policy Networks: A Case Study with the Double-Integrator System

Zhang,

Han,

et al. 2024

AAAI

View full text Add to dashboard Cite

Extensive utilization of deep reinforcement learning (DRL) policy networks in diverse continuous control tasks has raised questions regarding performance degradation in expansive state spaces where the input state norm is larger than that in the training environment. This paper aims to uncover the underlying factors contributing to such performance deterioration when dealing with expanded state spaces, using a novel analysis technique known as state division. In contrast to prior approaches that employ state division merely as a post-hoc explanatory tool, our methodology delves into the intrinsic characteristics of DRL policy networks. Specifically, we demonstrate that the expansion of state space induces the activation function $\tanh$ to exhibit saturability, resulting in the transformation of the state division boundary from nonlinear to linear. Our analysis centers on the paradigm of the double-integrator system, revealing that this gradual shift towards linearity imparts a control behavior reminiscent of bang-bang control. However, the inherent linearity of the division boundary prevents the attainment of an ideal bang-bang control, thereby introducing unavoidable overshooting. Our experimental investigations, employing diverse RL algorithms, establish that this performance phenomenon stems from inherent attributes of the DRL policy network, remaining consistent across various optimization algorithms.

show abstract

Safety-constrained reinforcement learning with a distributional safety critic

Cited by 18 publications

References 19 publications

Subtask-masked curriculum learning for reinforcement learning with application to UAV maneuver decision-making

Subtask-masked curriculum learning for reinforcement learning with application to UAV maneuver decision-making

Distributional Safety Critic for Stochastic Latent Actor-Critic

Analyzing Generalization in Policy Networks: A Case Study with the Double-Integrator System

Contact Info

Product

Resources

About