Advances in Preference-based Reinforcement Learning: A Review

Abdelkareem, Youssef; Shehata, Shady; Karray, Fakhri

doi:10.1109/smc53654.2022.9945333

Cited by 4 publications

(2 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An alternative to mitigate the sparsity of reward feedback is preference-based reinforcement learning (PbRL) [8], [9]. In PbRL, instead of directly receiving the instant reward information on each encountered state-action pair, the agent only obtains 1-bit preference feedback for each state-action pair or trajectory from a human overseer [10], [11].…”

mentioning

confidence: 99%

Beijing Language and Culture University, China

Wang¹

2022

The Experience of Examining the PhD

View full text Add to dashboard Cite

In the study of network synchronization, an outstanding question of both theoretical and practical significance is how to allocate a given set of heterogenous oscillators on a complex network in order for improving the synchronization performance. Whereas methods have been proposed to address this question in literature, the methods are based on accurate models describing the system dynamics, which, however, are normally unavailable in realistic situations. Here we show that this question can be addressed by the model-free technique of feed-forward neural network (FNN) in machine learning. Specifically, we measure the synchronization performance of a number of allocation schemes and use the measured data to train a machine. It is found that the trained machine is able to not only infer the synchronization performance of any new allocation scheme, but also find from a huge amount of candidates the optimal allocation scheme for synchronization.

show abstract

mentioning

confidence: 99%

Beijing Language and Culture University, China

Wang¹

2022

The Experience of Examining the PhD

View full text Add to dashboard Cite

show abstract

“…On the other hand, in the literature of ranking, most of the theoretical work focuses on the tabular case where the rewards for different actions are uncorrelated (Feige et al, 1994;Shah et al, 2015;Shah and Wainwright, 2017;Heckel et al, 2018;Mao et al, 2018;Jang et al, 2017;Chen et al, 2013;Chen and Suh, 2015;Rajkumar and Agarwal, 2014;Negahban et al, 2018;Hajek et al, 2014;Heckel et al, 2019). And a majority of the empirical literature focuses on the framework of learning to rank (MLE) under general function approximation, especially when the reward is parameterized by a neural network (Liu et al, 2009;Xia et al, 2008;Cao et al, 2007;Christiano et al, 2017a;Ouyang et al, 2022;Brown et al, 2019;Shin et al, 2023;Busa-Fekete et al, 2014;Wirth et al, 2016Wirth et al, , 2017Christiano et al, 2017b;Abdelkareem et al, 2022). Similar idea of RL with AI feedback also learns a reward model from preference Bai et al (2022b), except for that the preference is labeled by another AI model instead of human.…”

Section: Related Workmentioning

confidence: 99%

Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons

Zhu¹,

Jiao²,

Jordan³

2023

Preprint

View full text Add to dashboard Cite

We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the K-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. We also unify the problem of RLHF and max-entropy Inverse Reinforcement Learning (IRL), and provide the first sample complexity bound for max-entropy IRL.

show abstract

Hierarchical Reinforcement Learning from Demonstration via Reachability-Based Reward Shaping

Gao,

Liu,

Wan

et al. 2024

Neural Process Lett

View full text Add to dashboard Cite

Hierarchical reinforcement learning (HRL) has achieved remarkable success and significant progress in complex and long-term decision-making problems. However, HRL training typically entails substantial computational costs and an enormous number of samples. One effective approach to tackle this challenge is hierarchical reinforcement learning from demonstrations (HRLfD), which leverages demonstrations to expedite the training process of HRL. The effectiveness of HRLfD is contingent upon the quality of the demonstrations; hence, suboptimal demonstrations may impede efficient learning. To address this issue, this paper proposes a reachability-based reward shaping (RbRS) method to alleviate the negative interference of suboptimal demonstrations for the HRL agent. The novel HRLfD algorithm based on RbRS is named HRLfD-RbRS, which incorporates the RbRS method to enhance the learning efficiency of HRLfD. Moreover, with the help of this method, the learning agent can explore better policies under the guidance of the suboptimal demonstration. We evaluate the proposed HRLfD-RbRS algorithm on various complex robotic tasks, and the experimental results demonstrate that our method outperforms current state-of-the-art HRLfD algorithms.

show abstract

Advances in Preference-based Reinforcement Learning: A Review

Cited by 4 publications

References 19 publications

Beijing Language and Culture University, China

Beijing Language and Culture University, China

Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons

Hierarchical Reinforcement Learning from Demonstration via Reachability-Based Reward Shaping

Contact Info

Product

Resources

About