2022
DOI: 10.48550/arxiv.2202.11566
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning

Abstract: Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment. Directly applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions. Previous methods tackle such problems by penalizing the Q-values of OOD actions or constraining the trained policy to be close to the behavior policy. Nevertheless, such methods typically prevent the generalization of value functions beyo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
19
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(19 citation statements)
references
References 6 publications
0
19
0
Order By: Relevance
“…Pessimism in batch RL serves as the main tool for conservative policy optimization by quantifying the uncertainty of the estimation and discouraging the exploration of the learned policy from visiting the less explored state-action pair in the batch data. The success of the pessimsitic-typed algorithms has been demonstrated in many applications (e.g., Kumar et al, 2019;Bai et al, 2022). In terms of the data coverage, instead of the full-coverage assumption (i.e., dπ b T is uniformly bounded away from 0) required by many classic RL algorithms, algorithms with a proper degree of pessimism only require that the in-class optimal policy π * is covered by the behavior one, which is thus more desirable.…”
Section: Policy Learning In Batch Rlmentioning
confidence: 99%
“…Pessimism in batch RL serves as the main tool for conservative policy optimization by quantifying the uncertainty of the estimation and discouraging the exploration of the learned policy from visiting the less explored state-action pair in the batch data. The success of the pessimsitic-typed algorithms has been demonstrated in many applications (e.g., Kumar et al, 2019;Bai et al, 2022). In terms of the data coverage, instead of the full-coverage assumption (i.e., dπ b T is uniformly bounded away from 0) required by many classic RL algorithms, algorithms with a proper degree of pessimism only require that the in-class optimal policy π * is covered by the behavior one, which is thus more desirable.…”
Section: Policy Learning In Batch Rlmentioning
confidence: 99%
“…There is a growing number of results under partial coverage following the principle of pessimism in offline RL (Yu et al, 2020;Kidambi et al, 2020). In comparison to works that focus on tabular (Rashidinejad et al, 2021;Shi et al, 2022;Yin and Wang, 2021) or linear models (Jin et al, 2020;Chang et al, 2021;Zhang et al, 2022;Nguyen-Tang et al, 2022;Bai et al, 2022), our emphasis is on general function approximation (Jiang and Huang, 2020;Uehara and Sun, 2022;Xie et al, 2021;Zhan et al, 2022;Rashidinejad et al, 2022;Zanette and Wainwright, 2022). Among them, we specifically focus on model-free methods.…”
Section: Related Workmentioning
confidence: 99%
“…To summarize, the algorithm alternates between collecting on-policy data samples in AMG and updating the function approximators. In detail, the latter procedure includes updating the Q-value with (11), updating the optimized policy with (14), and updating the target Q-value as well as the reference policy with the moving-average rule. The complete algorithm is listed in Appendix D.…”
Section: Offline Reinforcement Learning With Pessimism-modulated Dyna...mentioning
confidence: 99%
“…Learning within the data manifold limits the degree to which the policy improves, and recent works attempt to relieve the restriction. Along the model-free line, EDAC [13] and PBRL [14] quantify uncertainty of Q-value via neural network ensemble, and assign penalty to Q-value depending on the uncertainty degree. In this way, the OOD state-action pairs are touchable if they pose low uncertainty on Q-value.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation