Avoiding Side Effects in Complex Environments

Turner, Alexander; Ratzlaff, Neale; Tadepalli, Prasad

doi:10.48550/arxiv.2006.06547

Cited by 3 publications

(4 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…That is, RL algorithms often discover ways to achieve high returns by unexpected, unintended means. In general, there is nuance in how we might want agents to behave, such as obeying social norms, that are difficult to account for and communicate effectively through an engineered reward function (Amodei et al, 2016;Shah et al, 2019;Turner et al, 2020). A popular way to avoid reward engineering is through imitation learning, during which a learner distills information about its objectives or tries to directly follow an expert (Schaal, 1997;Ng et al, 2000;Abbeel & Ng, 2004;Argall et al, 2009).…”

Section: Introductionmentioning

confidence: 99%

PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training

Lee¹,

Smith²,

Abbeel³

2021

Preprint

View full text Add to dashboard Cite

Conveying complex objectives to reinforcement learning (RL) agents can often be difficult, involving meticulous design of reward functions that are sufficiently informative yet easy enough to provide. Human-in-the-loop RL methods allow practitioners to instead interactively teach agents through tailored feedback; however, such approaches have been challenging to scale since human feedback is very expensive. In this work, we aim to make this process more sample-and feedback-efficient. We present an off-policy, interactive RL algorithm that capitalizes on the strengths of both feedback and off-policy learning. Specifically, we learn a reward model by actively querying a teacher's preferences between two clips of behavior and use it to train an agent. To enable off-policy learning, we relabel all the agent's past experience when its reward model changes. We additionally show that pre-training our agents with unsupervised exploration substantially increases the mileage of its queries. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods, including a variety of locomotion and robotic manipulation skills. We also show that our method is able to utilize real-time human feedback to effectively prevent reward exploitation and learn new behaviors that are difficult to specify with standard reward functions.

show abstract

Section: Introductionmentioning

confidence: 99%

PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training

Lee¹,

Smith²,

Abbeel³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Our work focuses on expanding the concept of side effects to more complex environments, generated by the SafeLife suite [Wainwright and Eckersley, 2020], which provides more complex environment dynamics and tasks that cannot be solved in a tabular fashion. Turner et al [2020] recently extended their approach to environments in the SafeLife suite, suggesting that attainable utility preservation can be used as an alternative to the SafeLife side metric described in Wainwright and Eckersley [2020] and Section 2. The primary differentiating feature of SARL is that it is metric agnostic, for both the reward and side effect measure, making it orthogonal and complimentary to the work by Turner et al [2020].…”

Section: Side Effects In Reinforcement Learning Environmentsmentioning

confidence: 99%

“…Turner et al [2020] recently extended their approach to environments in the SafeLife suite, suggesting that attainable utility preservation can be used as an alternative to the SafeLife side metric described in Wainwright and Eckersley [2020] and Section 2. The primary differentiating feature of SARL is that it is metric agnostic, for both the reward and side effect measure, making it orthogonal and complimentary to the work by Turner et al [2020].…”

Section: Side Effects In Reinforcement Learning Environmentsmentioning

confidence: 99%

Safety Aware Reinforcement Learning (SARL)

Miret¹,

Majumdar²,

Wainwright³

2020

Preprint

View full text Add to dashboard Cite

As reinforcement learning agents become increasingly integrated into complex, real-world environments, designing for safety becomes a critical consideration. We specifically focus on researching scenarios where agents can cause undesired side effects while executing a policy on a primary task. Since one can define multiple tasks for a given environment dynamics, there are two important challenges. First, we need to abstract the concept of safety that applies broadly to that environment independent of the specific task being executed. Second, we need a mechanism for the abstracted notion of safety to modulate the actions of agents executing different policies to minimize their side-effects. In this work, we propose Safety Aware Reinforcement Learning (SARL) -a framework where a virtual safe agent modulates the actions of a main reward-based agent to minimize side effects. The safe agent learns a task-independent notion of safety for a given environment. The main agent is then trained with a regularization loss given by the distance between the native action probabilities of the two agents. Since the safe agent effectively abstracts a task-independent notion of safety via its action probabilities, it can be ported to modulate multiple policies solving different tasks within the given environment without further training. We contrast this with solutions that rely on task-specific regularization metrics and test our framework on the SafeLife Suite, based on Conway's Game of Life, comprising a number of complex tasks in dynamic environments. We show that our solution is able to match the performance of solutions that rely on task-specific side-effect penalties on both the primary and safety objectives while additionally providing the benefit of generalizability and portability.Preprint. Under review.

show abstract

“…However, today's RL systems require substantial manual human effort to engineer rewards for each task which comes with two fundamental drawbacks. The human effort required to design rewards is impractical to scale across numerous and diverse task categories and the engineered rewards can often be exploited by the RL agent to produce unintended and potentially unsafe control policies [8,9,10]. Moreover, it becomes increasingly difficult to design reward functions for the kinds of complex tasks with compositional structure often encountered real-world settings.…”

Section: Introductionmentioning

confidence: 99%

Skill Preferences: Learning to Extract and Execute Robotic Skills from Human Feedback

Wang¹,

Lee²,

Hakhamaneshi³

et al. 2021

Preprint

View full text Add to dashboard Cite

A promising approach to solving challenging long-horizon tasks has been to extract behavior priors (skills) by fitting generative models to large offline datasets of demonstrations. However, such generative models inherit the biases of the underlying data and result in poor and unusable skills when trained on imperfect demonstration data. To better align skill extraction with human intent we present Skill Preferences (SkiP), an algorithm that learns a model over human preferences and uses it to extract human-aligned skills from offline data. After extracting human-preferred skills, SkiP also utilizes human feedback to solve downstream tasks with RL. We show that SkiP enables a simulated kitchen robot to solve complex multi-step manipulation tasks and substantially outperforms prior leading RL algorithms with human preferences as well as leading skill extraction algorithms without human preferences.

show abstract

Avoiding Side Effects in Complex Environments

Cited by 3 publications

References 9 publications

PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training

PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training

Safety Aware Reinforcement Learning (SARL)

Skill Preferences: Learning to Extract and Execute Robotic Skills from Human Feedback

Contact Info

Product

Resources

About