2017
DOI: 10.48550/arxiv.1706.03741
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Deep reinforcement learning from human preferences

Abstract: For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than 1% of our agent's interactions with th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
79
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 76 publications
(79 citation statements)
references
References 14 publications
(23 reference statements)
0
79
0
Order By: Relevance
“…PREFERENCES: A reward model learned using preference comparisons, similar to the approach from Christiano et al (2017). We synthesize preference labels based on ground-truth return.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…PREFERENCES: A reward model learned using preference comparisons, similar to the approach from Christiano et al (2017). We synthesize preference labels based on ground-truth return.…”
Section: Methodsmentioning
confidence: 99%
“…Rewards can be learned from supervision provided by explicit labels. For example, human-labeled preferences between demonstrations can be used to learn reward functions (Christiano et al, 2017). Rewards can also be labeled on a per-timestep basis, allowing for learning via supervised regression (Cabi et al, 2019).…”
Section: Learning Reward Functionsmentioning
confidence: 99%
See 1 more Smart Citation
“…Apart from the vast array of work from the Imitation Learning literature (Zheng et al 2021), which is largely motivated by the difficulty of designing reward functions and seeks to learn a policy from expert demonstrations instead, several other approaches have been specifically aimed at the reward specification problem. Some of these approaches introduce a human in the loop to either guide the agent towards the desired behavior (Christiano et al 2017) or to prevent it from making catastrophic errors while exploring the environment (Saunders et al 2017). While our approach of using CMDPs for behavior specification also seeks to make better use of human knowledge, we focus on tasks where the human input can be provided as a zero-shot interaction ahead of training time by simply specifying indicator cost functions and their corresponding thresholds rather than requiring human feedback during the training process.…”
Section: Reward Specificationmentioning
confidence: 99%
“…E.g., until now, many open-sourced RL datasets [19,22,27] are shared as a set of state/action pairs comparable to example/label pairs in supervised learning. Although this format is convenient for some algorithms [36], it discards the temporal information initially present in the data and prevents building methods exploiting such information [34,24,25,9,14]. Even when temporal information is present in a dataset, what constitutes a step, a transition, and an episode is not always consistent.…”
Section: Introductionmentioning
confidence: 99%