Interactive Learning from Policy-Dependent Human Feedback

MacGlashan, James; Ho, Mark K.; Loftin, Robert; Peng, Bei; Roberts, David C. S.; Taylor, Matthew E.; Littman, Michael L.

doi:10.48550/arxiv.1701.06049

Cited by 12 publications

(17 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Human-in-the-Loop Policy Learning: Human-in-the-Loop Policy Learning allows a human to provide additional supervision during the policy learning process. One paradigm is Reinforcement Learning (RL) with human feedback [40], where a human provides rewards during agent training [8,11,24,25,34,36], but this suffers from the same limitations as IRL due to the need for extensive agent interaction.…”

Section: Related Workmentioning

confidence: 99%

Human-in-the-Loop Imitation Learning using Remote Teleoperation

Mandlekar¹,

Xu²,

Roberto³

et al. 2020

Preprint

View full text Add to dashboard Cite

Imitation Learning is a promising paradigm for learning complex robot manipulation skills by reproducing behavior from human demonstrations. However, manipulation tasks often contain bottleneck regions that require a sequence of precise actions to make meaningful progress, such as a robot inserting a pod into a coffee machine to make coffee. Trained policies can fail in these regions because small deviations in actions can lead the policy into states not covered by the demonstrations. Intervention-based policy learning is an alternative that can address this issue -it allows human operators to monitor trained policies and take over control when they encounter failures. In this paper, we build a data collection system tailored to 6-DoF manipulation settings, that enables remote human operators to monitor and intervene on trained policies. We develop a simple and effective algorithm to train the policy iteratively on new data collected by the system that encourages the policy to learn how to traverse bottlenecks through the interventions. We demonstrate that agents trained on data collected by our intervention-based system and algorithm outperform agents trained on an equivalent number of samples collected by non-interventional demonstrators, and further show that our method outperforms multiple state-ofthe-art baselines for learning from the human interventions on a challenging robot threading task and a coffee making task. Additional results and videos at https://sites.google. com/stanford.edu/iwr

show abstract

Section: Related Workmentioning

confidence: 99%

Human-in-the-Loop Imitation Learning using Remote Teleoperation

Mandlekar¹,

Xu²,

Roberto³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Our work relates closely to the growing literature of interactive reinforcement learning (RL), or humancentered RL [2,21,22,23,24,25,26,27,28,29] , in which agents learn from interactions with humans in addition to, or instead of, predefined environmental rewards. In the EMPATHIC framework, we use the term implicit human feedback to refer to any multi-modal evaluative signals humans naturally emit during social interactions, including facial expressions, tone of voice, head gestures, hand gestures and other body-language and vocalization modalities not aimed at explicit communication.…”

Section: Related Workmentioning

confidence: 99%

The EMPATHIC Framework for Task Learning from Implicit Human Feedback

Cui¹,

Zhang²,

Allievi³

et al. 2020

Preprint

View full text Add to dashboard Cite

Reactions such as gestures, facial expressions, and vocalizations are an abundant, naturally occurring channel of information that humans provide during interactions. A robot or other agent could leverage an understanding of such implicit human feedback to improve its task performance at no cost to the human. This approach contrasts with common agent teaching methods based on demonstrations, critiques, or other guidance that need to be attentively and intentionally provided. In this paper, we first define the general problem of learning from implicit human feedback and then propose to address this problem through a novel data-driven framework, EMPATHIC. This two-stage method consists of (1) mapping implicit human feedback to relevant task statistics such as rewards, optimality, and advantage; and (2) using such a mapping to learn a task. We instantiate the first stage and three second-stage evaluations of the learned mapping. To do so, we collect a dataset of human facial reactions while participants observe an agent execute a sub-optimal policy for a prescribed training task. We train a deep neural network on this data and demonstrate its ability to (1) infer relative reward ranking of events in the training task from prerecorded human facial reactions; (2) improve the policy of an agent in the training task using live human facial reactions; and (3) transfer to a novel domain in which it evaluates robot manipulation trajectories.

show abstract

“…Preference learning. Much recent work has learned preferences from different sources of data, such as demonstrations (Ziebart et al, 2010;Ramachandran and Amir, 2007;Ho and Ermon, 2016;Fu et al, 2017;Finn et al, 2016), comparisons (Christiano et al, 2017Sadigh et al, 2017;Wirth et al, 2017), ratings (Daniel et al, 2014), human reinforcement signals (Knox and Stone, 2009;Warnell et al, 2017;MacGlashan et al, 2017), proxy rewards (Hadfield-Menell et al, 2017), etc. We suggest preference learning with a new source of data: the state of the environment when the robot is first deployed.…”

Section: Related Workmentioning

confidence: 99%

Preferences Implicit in the State of the World

Shah¹,

Dmitrii²,

Alexander³

et al. 2019

Preprint

View full text Add to dashboard Cite

Reinforcement learning (RL) agents optimize only the features specified in a reward function and are indifferent to anything left out inadvertently. This means that we must not only specify what to do, but also the much larger space of what not to do. It is easy to forget these preferences, since these preferences are already satisfied in our environment. This motivates our key insight: when a robot is deployed in an environment that humans act in, the state of the environment is already optimized for what humans want. We can therefore use this implicit preference information from the state to fill in the blanks. We develop an algorithm based on Maximum Causal Entropy IRL and use it to evaluate the idea in a suite of proof-of-concept environments designed to show its properties. We find that information from the initial state can be used to infer both side effects that should be avoided as well as preferences for how the environment should be organized. Our code can be found at https://github.com/HumanCompatibleAI/rlsp.

show abstract

Interactive Learning from Policy-Dependent Human Feedback

Cited by 12 publications

References 0 publications

Human-in-the-Loop Imitation Learning using Remote Teleoperation

Human-in-the-Loop Imitation Learning using Remote Teleoperation

The EMPATHIC Framework for Task Learning from Implicit Human Feedback

Preferences Implicit in the State of the World

Contact Info

Product

Resources

About