2020
DOI: 10.48550/arxiv.2002.05522
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BRPO: Batch Residual Policy Optimization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 0 publications
0
3
0
Order By: Relevance
“…In this work, we demonstrate the benefit of using a state-dependent behavior regularization term. This draws the connection to other methods (Laroche, Trichelair, and Des Combes 2019;Sohn et al 2020), that bootstrap the learned policy with the behavior policy only when the current state-action pair's uncertainty is high, allowing the learned policy to differ from the behavior policy for the largest improvements. However, these methods measure the uncertainty by the visitation frequency of state-action pairs in the dataset, which is computationally expensive and nontrivial to apply in continuous control settings.…”
Section: Related Workmentioning
confidence: 86%
See 1 more Smart Citation
“…In this work, we demonstrate the benefit of using a state-dependent behavior regularization term. This draws the connection to other methods (Laroche, Trichelair, and Des Combes 2019;Sohn et al 2020), that bootstrap the learned policy with the behavior policy only when the current state-action pair's uncertainty is high, allowing the learned policy to differ from the behavior policy for the largest improvements. However, these methods measure the uncertainty by the visitation frequency of state-action pairs in the dataset, which is computationally expensive and nontrivial to apply in continuous control settings.…”
Section: Related Workmentioning
confidence: 86%
“…The second problem is that, even though we can estimate an accurate Q function, the behavior regularization term may be too restrictive, which will hinder the performance of the learned policy. An ideal behavior regularization term should be state-dependent (Sohn et al 2020). This will make the policy less conservative, exploit large policy changes at high confidence states without risking poor performance at low confidence states.…”
Section: Offline Reinforcement Learningmentioning
confidence: 99%
“…This can lead the policy to exploit approximation errors of the dynamics model and be disastrous for learning. In model-free settings, similar data distribution shift problems are typically remedied by regularizing policy updates explicitly with a divergence from the observed data distribution [26,30,61], which, however, can overly limit policies' expressivity [57].…”
Section: Introductionmentioning
confidence: 99%