BRPO: Batch Residual Policy Optimization

Sohn, Sungryull; Chow, Yinlam; Ooi, Jayden; Nachum, Ofir; Lee, Honglak; Chi, Ed; Boutilier, Craig

doi:10.48550/arxiv.2002.05522

Cited by 2 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work, we demonstrate the benefit of using a state-dependent behavior regularization term. This draws the connection to other methods (Laroche, Trichelair, and Des Combes 2019;Sohn et al 2020), that bootstrap the learned policy with the behavior policy only when the current state-action pair's uncertainty is high, allowing the learned policy to differ from the behavior policy for the largest improvements. However, these methods measure the uncertainty by the visitation frequency of state-action pairs in the dataset, which is computationally expensive and nontrivial to apply in continuous control settings.…”

Section: Related Workmentioning

confidence: 86%

“…The second problem is that, even though we can estimate an accurate Q function, the behavior regularization term may be too restrictive, which will hinder the performance of the learned policy. An ideal behavior regularization term should be state-dependent (Sohn et al 2020). This will make the policy less conservative, exploit large policy changes at high confidence states without risking poor performance at low confidence states.…”

Section: Offline Reinforcement Learningmentioning

confidence: 99%

See 1 more Smart Citation

Offline Reinforcement Learning with Soft Behavior Regularization

Xu,

Zhan,

et al. 2021

Preprint

View full text Add to dashboard Cite

Most prior approaches to offline reinforcement learning (RL) utilize behavior regularization, typically augmenting existing off-policy actor critic algorithms with a penalty measuring divergence between the policy and the offline data. However, these approaches lack guaranteed performance improvement over the behavior policy. In this work, we start from the performance difference between the learned policy and the behavior policy, we derive a new policy learning objective that can be used in the offline setting, which corresponds to the advantage function value of the behavior policy, multiplying by a statemarginal density ratio. We propose a practical way to compute the density ratio and demonstrate its equivalence to a statedependent behavior regularization. Unlike state-independent regularization used in prior approaches, this soft regularization allows more freedom of policy deviation at high confidence states, leading to better performance and stability. We thus term our resulting algorithm Soft Behavior-regularized Actor Critic (SBAC). Our experimental results show that SBAC matches or outperforms the state-of-the-art on a set of continuous control locomotion and manipulation tasks.

show abstract

Section: Related Workmentioning

confidence: 86%

Section: Offline Reinforcement Learningmentioning

confidence: 99%

Offline Reinforcement Learning with Soft Behavior Regularization

Xu,

Zhan,

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This can lead the policy to exploit approximation errors of the dynamics model and be disastrous for learning. In model-free settings, similar data distribution shift problems are typically remedied by regularizing policy updates explicitly with a divergence from the observed data distribution [26,30,61], which, however, can overly limit policies' expressivity [57].…”

Section: Introductionmentioning

confidence: 99%

Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Matsushima,

Furuta,

Matsuo

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become prohibitive to update the data-collection policy more than a few times during learning. With this view, we propose a novel concept of deployment efficiency, measuring the number of distinct data-collection policies that are used during policy learning. We observe that naïvely applying existing model-free offline RL algorithms recursively does not lead to a practical deployment-efficient and sample-efficient algorithm. We propose a novel model-based algorithm, Behavior-Regularized Model-ENsemble (BREMEN) that can effectively optimize a policy offline using 10-20 times fewer data than prior works. Furthermore, the recursive application of BREMEN is able to achieve impressive deployment efficiency while maintaining the same or better sample efficiency, learning successful policies from scratch on simulated robotic environments with only 5-10 deployments, compared to typical values of hundreds to millions in standard RL baselines. Codes and pre-trained models are available at https://github.com/matsuolab/BREMEN. * Equal contribution.Preprint. Under review.

show abstract

BRPO: Batch Residual Policy Optimization

Cited by 2 publications

References 0 publications

Offline Reinforcement Learning with Soft Behavior Regularization

Offline Reinforcement Learning with Soft Behavior Regularization

Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

Contact Info

Product

Resources

About