Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences

Bıyık, Erdem; Losey, Dylan P.; Palan, Malayandi; Landolfi, Nicholas C.; Shevchuk, Gleb; Sadigh, Dorsa

doi:10.1177/02783649211041652

Cited by 50 publications

(26 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior work has explored learning from expert behaviour and preferences (Ibarz et al, 2018;Palan et al, 2019;Bıyık et al, 2022;Koppol et al, 2020), or other multi-modal data sources (Tung et al, 2018;Jeon et al, 2020). One motivation is that different data sources may provide complementary reward information (Koppol et al, 2020), decreasing ambiguity.…”

Section: Related Workmentioning

confidence: 99%

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

Skalse¹,

Farrugia-Roberts²,

Russell³

et al. 2022

Preprint

View full text Add to dashboard Cite

It's challenging to design reward functions for complex, real-world tasks. Reward learning lets one instead infer reward functions from data. However, multiple reward functions often fit the data equally well, even in the infinite-data limit. Prior work often considers reward functions to be uniquely recoverable, by imposing additional assumptions on data sources. By contrast, we formally characterise the partial identifiability of popular data sources, including demonstrations and trajectory preferences, under multiple common sets of assumptions. We analyse the impact of this partial identifiability on downstream tasks such as policy optimisation, including under changes in environment dynamics. We unify our results in a framework for comparing data sources and downstream tasks by their invariances, with implications for the design and selection of data sources for reward learning.

show abstract

Section: Related Workmentioning

confidence: 99%

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

Skalse¹,

Farrugia-Roberts²,

Russell³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Hence it is natural to integrate preference and action demonstration via a joint IRL framework (Palan et al, 2019;Bıyık et al, 2020), with a nice insight that these two sources of information are complementary under the IRL framework: "demonstrations provide a high-level initialization of the human's overall reward functions, while preferences explore specific, fine-grained aspects of it" (Bıyık et al, 2020). Therefore they use demonstrations to initialize a reward distribution, and refine the reward function with preference queries (Palan et al, 2019;Bıyık et al, 2020). Ibarz et al (2018) takes a different approach to combine demonstration and preference information, by using human demonstrations to pre-train the agent.…”

Section: Learning From Human Preferencementioning

confidence: 99%

“…Under this framework, it is possible to develop a unified learning paradigm that accepts multiple types of human guidance. We start to notice efforts towards this goal (Abel et al, 2017;Waytowich et al, 2018;Goecks et al, 2019;Woodward et al, 2020;Najar et al, 2020;Bıyık et al, 2020).…”

Section: A Unified Learning Frameworkmentioning

confidence: 99%

Recent Advances in Leveraging Human Guidance for Sequential Decision-Making Tasks

Zhang,

Torabi,

Warnell

et al. 2021

Preprint

View full text Add to dashboard Cite

A longstanding goal of artificial intelligence is to create artificial agents capable of learning to perform tasks that require sequential decision making. Importantly, while it is the artificial agent that learns and acts, it is still up to humans to specify the particular task to be performed. Classical task-specification approaches typically involve humans providing stationary reward functions or explicit demonstrations of the desired tasks. However, there has recently been a great deal of research energy invested in exploring alternative ways in which humans may guide learning agents that may, e.g., be more suitable for certain tasks or require less human effort. This survey provides a high-level overview of five recent machine learning frameworks that primarily rely on human guidance apart from pre-specified reward functions or conventional, step-by-step action demonstrations. We review the motivation, assumptions, and implementation of each framework, and we discuss possible future research directions.

show abstract

“…Influential recent research has focused on reward learning from preferences over pairs of fixed-length trajectory segments. Nearly all of this recent work assumes that human preferences arise probabilistically from only the sum of rewards over a segment, i.e., the segment's partial return [9][10][11][12][13][14][15][16]. That is, these works assume that people tend to prefer trajectory segments that yield greater rewards during the segment.…”

Section: Introductionmentioning

confidence: 99%

Models of human preference for learning reward functions

Knox¹,

Hatgis-Kessell²,

Booth³

et al. 2022

Preprint

View full text Add to dashboard Cite

The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments. These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. We find this assumption to be flawed and propose modeling preferences instead as arising from a different statistic: each segment's regret, a measure of a segment's deviation from optimal decision-making. Given infinitely many preferences generated according to regret, we prove that we can identify a reward function equivalent to the reward function that generated those preferences. We also prove that the previous partial return model lacks this identifiability property without preference noise that reveals rewards' relative proportions, and we empirically show that our proposed regret preference model outperforms it with finite training data in otherwise the same setting. Additionally, our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned. Overall, this work establishes that the choice of preference model is impactful, and our proposed regret preference model provides an improvement upon a core assumption of recent research.

show abstract

Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences

Cited by 50 publications

References 32 publications

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

Recent Advances in Leveraging Human Guidance for Sequential Decision-Making Tasks

Models of human preference for learning reward functions

Contact Info

Product

Resources

About