2020
DOI: 10.48550/arxiv.2006.06547
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Avoiding Side Effects in Complex Environments

Abstract: Reward function specification can be difficult, even in simple environments. Realistic environments contain millions of states. Rewarding the agent for making a widget may be easy, but penalizing the multitude of possible negative side effects is hard. In toy environments, Attainable Utility Preservation (AUP) avoids side effects by penalizing shifts in the ability to achieve randomly generated goals. We scale this approach to large, randomly generated environments based on Conway's Game of Life. By preserving… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
4
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 9 publications
0
4
0
Order By: Relevance
“…That is, RL algorithms often discover ways to achieve high returns by unexpected, unintended means. In general, there is nuance in how we might want agents to behave, such as obeying social norms, that are difficult to account for and communicate effectively through an engineered reward function (Amodei et al, 2016;Shah et al, 2019;Turner et al, 2020). A popular way to avoid reward engineering is through imitation learning, during which a learner distills information about its objectives or tries to directly follow an expert (Schaal, 1997;Ng et al, 2000;Abbeel & Ng, 2004;Argall et al, 2009).…”
Section: Introductionmentioning
confidence: 99%
“…That is, RL algorithms often discover ways to achieve high returns by unexpected, unintended means. In general, there is nuance in how we might want agents to behave, such as obeying social norms, that are difficult to account for and communicate effectively through an engineered reward function (Amodei et al, 2016;Shah et al, 2019;Turner et al, 2020). A popular way to avoid reward engineering is through imitation learning, during which a learner distills information about its objectives or tries to directly follow an expert (Schaal, 1997;Ng et al, 2000;Abbeel & Ng, 2004;Argall et al, 2009).…”
Section: Introductionmentioning
confidence: 99%
“…Our work focuses on expanding the concept of side effects to more complex environments, generated by the SafeLife suite [Wainwright and Eckersley, 2020], which provides more complex environment dynamics and tasks that cannot be solved in a tabular fashion. Turner et al [2020] recently extended their approach to environments in the SafeLife suite, suggesting that attainable utility preservation can be used as an alternative to the SafeLife side metric described in Wainwright and Eckersley [2020] and Section 2. The primary differentiating feature of SARL is that it is metric agnostic, for both the reward and side effect measure, making it orthogonal and complimentary to the work by Turner et al [2020].…”
Section: Side Effects In Reinforcement Learning Environmentsmentioning
confidence: 99%
“…Turner et al [2020] recently extended their approach to environments in the SafeLife suite, suggesting that attainable utility preservation can be used as an alternative to the SafeLife side metric described in Wainwright and Eckersley [2020] and Section 2. The primary differentiating feature of SARL is that it is metric agnostic, for both the reward and side effect measure, making it orthogonal and complimentary to the work by Turner et al [2020].…”
Section: Side Effects In Reinforcement Learning Environmentsmentioning
confidence: 99%
“…However, today's RL systems require substantial manual human effort to engineer rewards for each task which comes with two fundamental drawbacks. The human effort required to design rewards is impractical to scale across numerous and diverse task categories and the engineered rewards can often be exploited by the RL agent to produce unintended and potentially unsafe control policies [8,9,10]. Moreover, it becomes increasingly difficult to design reward functions for the kinds of complex tasks with compositional structure often encountered real-world settings.…”
Section: Introductionmentioning
confidence: 99%