2021
DOI: 10.48550/arxiv.2106.06431
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Offline Reinforcement Learning as Anti-Exploration

Abstract: Offline Reinforcement Learning (RL) aims at learning an optimal control from a fixed dataset, without interactions with the system. An agent in this setting should avoid selecting actions whose consequences cannot be predicted from the data. This is the converse of exploration in RL, which favors such actions. We thus take inspiration from the literature on bonus-based exploration to design a new offline RL agent. The core idea is to subtract a prediction-based exploration bonus from the reward, instead of add… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 25 publications
0
4
0
Order By: Relevance
“…Recently, there have been many offline RL methods proposed to implement such a penalty, by introducing support constrains [Fujimoto, Meger, and Precup 2019, Ghasemipour, Schuurmans, and Gu 2021 or policy regularization [Kostrikov, Nair, and Levine 2021, Fujimoto and Gu 2021, Peng et al 2019, Wu, Tucker, and Nachum 2019, Kumar et al 2019a. Alternatively, some offline RL methods introduce the uncertainty estimation [Rezaeifar et al 2021, Wu et al 2021, Ma, Jayaraman, and Bastani 2021, Bai et al 2022 or the conservation [Kumar et al 2020, Lyu et al 2022] over values to overcome the potential overestimation for OOD state actions. In the same spirit, model-based offline RL methods similarly employ the distribution (correcting) regularization [Hishinuma and Senda 2021, uncertain estimation [Kidambi et al 2020, and value conservation [Yu et al 2021] to eliminate the OOD issues.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, there have been many offline RL methods proposed to implement such a penalty, by introducing support constrains [Fujimoto, Meger, and Precup 2019, Ghasemipour, Schuurmans, and Gu 2021 or policy regularization [Kostrikov, Nair, and Levine 2021, Fujimoto and Gu 2021, Peng et al 2019, Wu, Tucker, and Nachum 2019, Kumar et al 2019a. Alternatively, some offline RL methods introduce the uncertainty estimation [Rezaeifar et al 2021, Wu et al 2021, Ma, Jayaraman, and Bastani 2021, Bai et al 2022 or the conservation [Kumar et al 2020, Lyu et al 2022] over values to overcome the potential overestimation for OOD state actions. In the same spirit, model-based offline RL methods similarly employ the distribution (correcting) regularization [Hishinuma and Senda 2021, uncertain estimation [Kidambi et al 2020, and value conservation [Yu et al 2021] to eliminate the OOD issues.…”
Section: Related Workmentioning
confidence: 99%
“…Previous model-free offline RL algorithms typically rely on policy constraints to restrict the learned policy from producing the OOD actions. In particular, previous works add behavior cloning (BC) loss in policy training (Fujimoto et al, 2019;Fujimoto & Gu, 2021;Ghasemipour et al, 2021), measure the divergence between the behavior policy and the learned policy (Kumar et al, 2019;Kostrikov et al, 2021), apply advantage-weighted constraints to balance BC and advantages (Siegel et al, 2020;Wang et al, 2020b), penalize the prediction-error of a variational auto-encoder (Rezaeifar et al, 2021), and learn latent actions (or primitives) from the offline data (Zhou et al, 2020;Ajay et al, 2021). Nevertheless, such methods may cause overly conservative value functions and are easily affected by the behavior policy (Nair et al, 2020;Lee et al, 2021b).…”
Section: Related Workmentioning
confidence: 99%
“…Our work follows the latter line (also known as the principle of pessimism), which has garnered significant attention recently. In fact, pessimism has been incorporated into recent development of various offline RL approaches, such as policy-based approaches (Rezaeifar et al, 2021;Xie et al, 2021a;Zanette et al, 2021), model-based approaches (Jin et al, 2021;Kidambi et al, 2020;Rashidinejad et al, 2021;Xie et al, 2021b;Yin and Wang, 2021;Yin et al, 2022;Yu et al, 2021bYu et al, , 2020, and model-free approaches (Kumar et al, 2020;Yan et al, 2022;Yu et al, 2021a).…”
Section: Related Workmentioning
confidence: 99%