2021
DOI: 10.48550/arxiv.2112.07701
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning

Abstract: Reinforcement Learning (RL) agents in the real world must satisfy safety constraints in addition to maximizing a reward objective. Model-based RL algorithms hold promise for reducing unsafe real-world actions: they may synthesize policies that obey all constraints using simulated samples from a learned model. However, imperfect models can result in realworld constraint violations even for actions that are predicted to satisfy all constraints. We propose Conservative and Adaptive Penalty (CAP), a model-based sa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
3

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 27 publications
0
4
0
Order By: Relevance
“…Through extensive experiments, we have shown that SMODICE significantly outperforms prior state-of-art methods in all three settings. We believe that the generality of SMODICE invites many future work directions, including offline model-based RL (Yu et al, 2020;Kidambi et al, 2020), safe RL (Ma et al, 2021b), and extending it to visual domains.…”
Section: Discussionmentioning
confidence: 99%
“…Through extensive experiments, we have shown that SMODICE significantly outperforms prior state-of-art methods in all three settings. We believe that the generality of SMODICE invites many future work directions, including offline model-based RL (Yu et al, 2020;Kidambi et al, 2020), safe RL (Ma et al, 2021b), and extending it to visual domains.…”
Section: Discussionmentioning
confidence: 99%
“…In the weakest form, Kurenkov & Kolesnikov (2021) introduce an online evaluation budget that let's them find the best set of hyperparameters for an offline RL algorithm given limited online evaluation resources. Ma et al (2021) propose a conservative adaptive penalty, that penalizes unknown behavior more during the beginning and less during the end of training, leading to safer policies during training. Both Pong et al (2021) and Zhao et al (2021) propose methods for effective online learning phases that follow the offline learning phase.…”
Section: Related Workmentioning
confidence: 99%
“…In this setting, an unsafe area is defined as the set of states for which the ensemble of models has the highest uncertainty. A different strategy is to inflate the cost function with an uncertainty-aware penalty function, as done in CAP [21]. This cost change can be applied to any MBRL algorithm and its conservativeness is automatically tuned through the use of a PI controller.…”
Section: Related Workmentioning
confidence: 99%
“…1: the agent, thanks to the model of its dynamics, can perform "mental simulation" and select the best plan to attain its goal while avoiding unsafe zones. This contrasts with many of the methods in the literature that address the problem of finding a safe course of action through Lagrangian optimization or by penalizing the reward function [36,21,9]. GuSS avoids unsafe situations by discarding trajectories that are deemed unsafe using the model predictions.…”
Section: Introductionmentioning
confidence: 99%