2023
DOI: 10.48550/arxiv.2301.11270
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons

Abstract: We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 34 publications
0
1
0
Order By: Relevance
“…The AI feedback focuses on controlling the outputs to be less harmful by explaining its objections to dangerous queries. Moreover, recently a preliminary theoretical analysis of the RLAIF [51] justifies the empirical success of RLHF and provides new insights for specialized RLHF algorithm design for language models.…”
Section: Reinforcement Learning From Human Feedbackmentioning
confidence: 98%
“…The AI feedback focuses on controlling the outputs to be less harmful by explaining its objections to dangerous queries. Moreover, recently a preliminary theoretical analysis of the RLAIF [51] justifies the empirical success of RLHF and provides new insights for specialized RLHF algorithm design for language models.…”
Section: Reinforcement Learning From Human Feedbackmentioning
confidence: 98%