2022
DOI: 10.48550/arxiv.2206.02231
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Models of human preference for learning reward functions

Abstract: The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments. These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. We find this assumption to be flawed and propose modeling preferences instead as arising from a different statistic: each s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(9 citation statements)
references
References 14 publications
0
9
0
Order By: Relevance
“…[0, 1]) specifies the probability of an action given a state. Q ⇡ r and V ⇡ r refer respectively to the state-action value function and state value function for a 1 Appendix B.2 of Knox et al (2022) includes discounting. policy, ⇡, under r, and are defined as follows.…”
Section: Preliminaries: Preference Models For Learning Reward Functionsmentioning
confidence: 99%
See 2 more Smart Citations
“…[0, 1]) specifies the probability of an action given a state. Q ⇡ r and V ⇡ r refer respectively to the state-action value function and state value function for a 1 Appendix B.2 of Knox et al (2022) includes discounting. policy, ⇡, under r, and are defined as follows.…”
Section: Preliminaries: Preference Models For Learning Reward Functionsmentioning
confidence: 99%
“…The right pair of segments instead illustrates the effect of differing start states, equivalent partial return, and the same end states (dark blue), and it permits an identical analysis. Knox et al (2022) showed that regret-based preferences have the desirable theoretical property of identifiability and that partial return does not. Further, the regret model better fit the dataset of human preferences they collected.…”
Section: Goal Goalmentioning
confidence: 99%
See 1 more Smart Citation
“…These works typically assume that noisy human preferences over a pair of trajectories are correlated with the difference in their utilities (i.e., the reward acts as a latent term predictive of preference). Many contemporary methods estimate the latent reward by minimizing the cross-entropy loss between the reward-based predictions and the human-provided preferences (i.e., finding the reward that maximizes the likelihood of the observed preferences) [16,20,27,26,33].…”
Section: Related Workmentioning
confidence: 99%
“…[24] propose VAE to learn user preferences for spatial arrangement based on just the final state, while our approach models temporal preference from demonstrations. Preference-based RL learns rewards based on human preferences [57,58,59,60], but do not generalize to unseen preferences. On complex long-horizon tasks, modeling human preferences enables faster learning than RL, even with carefully designed rewards [61].…”
Section: Preferences and Prompt Trainingmentioning
confidence: 99%