2019
DOI: 10.1609/aaai.v33i01.33014967
|View full text |Cite
|
Sign up to set email alerts
|

Safe Policy Improvement with Baseline Bootstrapping in Factored Environments

Abstract: We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely deploy an updated policy, it is necessary to provide a confidence level regarding its expected performance. However, algorithms for safe policy improvement might require a large number of past experiences to become confident enough to c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(6 citation statements)
references
References 9 publications
0
6
0
Order By: Relevance
“…Interestingly, a very recent paper that is not in our analyzed corpus studied the problem of applying fairness constraints on node ranks in a graph [12]. Paper 19 is on safe policy improvement in RL [27] RL refer to policies that maximize expected return in problems where ensuring certain safety constraints is important alongside satisfactory performance [8]. In the context of ML fairness, safe policies can potentially be policies that satisfy equitable performance guarantees for sensitive demographic subgroups.…”
Section: Discussionmentioning
confidence: 99%
“…Interestingly, a very recent paper that is not in our analyzed corpus studied the problem of applying fairness constraints on node ranks in a graph [12]. Paper 19 is on safe policy improvement in RL [27] RL refer to policies that maximize expected return in problems where ensuring certain safety constraints is important alongside satisfactory performance [8]. In the context of ML fairness, safe policies can potentially be policies that satisfy equitable performance guarantees for sensitive demographic subgroups.…”
Section: Discussionmentioning
confidence: 99%
“…In this section, we study a particular algorithm family called SPIBB for Safe Policy Improvement with Baseline Bootstrapping [23,28,31,40,41,34]. It consists in allowing policy change only when the change is sufficiently supported by the dataset.…”
Section: Offline Rl: Spibb Theorymentioning
confidence: 99%
“…From there, they re-derive AWR. Interestingly, this approximation relies on a construction, given explicitly in their equation (41), which corresponds precisely to our definition of π in their specific setting.…”
Section: Experience Replays In Awrmentioning
confidence: 99%
“…In order to maximize the performance of this simple model, we instead consider the nearest abstract Cartesian state according to the abstraction function that gave the best performance in experiments (in this case: {x, y}). The Factored Batch RL approach is a method that has been used as a baseline in other offline RL work (Simão and Spaan 2019). It uses complete knowledge of the transition dynamics in the form of a dynamic Bayesian network and calculates the parameters of the model with the offline data.…”
Section: Tabular Experimentsmentioning
confidence: 99%