Dueling RL: Reinforcement Learning with Trajectory Preferences

Pacchiano, Aldo; Saha, Aadirupa

doi:10.48550/arxiv.2111.04850

Cited by 3 publications

(8 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then the α-Eluder dimension of F T is at most O(dr 2 log(rLS h/α)). Therefore, our results subsume the setting of logistic preference functions (Pacchiano et al, 2021) as a spacial case.…”

Section: General Function Approximationmentioning

confidence: 69%

“…The function spaces are general sets of functions, which may be either finitely parameterized or nonparametric. This setting is more general than the previous theoretical results for PbRL (Novoseller et al, 2020;Xu et al, 2020b;Pacchiano et al, 2021). Our contributions are summarized as follows:…”

Section: Introductionmentioning

confidence: 85%

“…Trajectory preference is the most general form of preference-based feedback, which is also the main focus of this work. As discussed in the introduction, previous theoretical results studying PbRL with trajectory feedback mainly focus on the tabular RL setting with finite state and action space (Novoseller et al, 2020;Xu et al, 2020b;Pacchiano et al, 2021). Besides PbRL, preferencebased learning has also been well-explored in bandit setting under the notion of "dueling bandits" (Yue et al, 2012;Falahatgar et al, 2017a;Busa-Fekete et al, 2018;Xu et al, 2020a;Busa-Fekete et al, 2018), which can be regarded as a special case of PbRL with single state and horizon H = 1.…”

Section: Related Workmentioning

confidence: 99%

“…Even worse, since we only get feedback about the preference between trajectory pairs, we cannot directly evaluate a single policy. Inspired from a recent work for PbRL in the tabular setting (Pacchiano et al, 2021), we construct a near-optimal policy set using the preference information.…”

Section: Confidence Sets and Bonusesmentioning

confidence: 99%

“…Xu et al (2020b) presents the first finitetime analysis for PbRL problems with near-optimal sample complexity bounds. Pacchiano et al (2021) studies the regret minimization problem for PbRL with linearlyparameterized preference function. However, all the previous algorithms are restricted to the tabular setting, and their complexity bounds scale polynomial dependence on the cardinality of the state-action space.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation

Chen¹,

Han²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. The goal of the agent is to learn the optimal policy which is most preferred by the human overseer. Despite the empirical successes, the theoretical understanding of preference-based RL (PbRL) is only limited to the tabular case. In this paper, we propose the first optimistic model-based algorithm for PbRL with general function approximation, which estimates the model using value-targeted regression and calculates the exploratory policies by solving an optimistic planning problem. Our algorithm achieves the regret of Õ(poly(dH) √ K), where d is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, H is the planning horizon, K is the number of episodes, and Õ(•) omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the PbRL problem by formulating a novel problem called RL with n-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.

show abstract

Section: General Function Approximationmentioning

confidence: 69%

Section: Introductionmentioning

confidence: 85%

Section: Related Workmentioning

confidence: 99%

Section: Confidence Sets and Bonusesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation

Chen¹,

Han²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Porous organic polymer bearing triazine and pyrene moieties as an efficient organocatalyst

Das

Chowdhury

Chakraborty

et al. 2020

Molecular Catalysis

View full text Add to dashboard Cite

Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons

Zhu¹,

Jiao²,

Jordan³

2023

Preprint

View full text Add to dashboard Cite

We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the K-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. We also unify the problem of RLHF and max-entropy Inverse Reinforcement Learning (IRL), and provide the first sample complexity bound for max-entropy IRL.

show abstract

Dueling RL: Reinforcement Learning with Trajectory Preferences

Cited by 3 publications

References 14 publications

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation

Porous organic polymer bearing triazine and pyrene moieties as an efficient organocatalyst

Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons

Contact Info

Product

Resources

About