2022
DOI: 10.1609/aaai.v36i7.20783
|View full text |Cite
|
Sign up to set email alerts
|

Offline Reinforcement Learning as Anti-exploration

Abstract: Offline Reinforcement Learning (RL) aims at learning an optimal control from a fixed dataset, without interactions with the system. An agent in this setting should avoid selecting actions whose consequences cannot be predicted from the data. This is the converse of exploration in RL, which favors such actions. We thus take inspiration from the literature on bonus-based exploration to design a new offline RL agent. The core idea is to subtract a prediction-based exploration bonus from the reward, instead of add… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(12 citation statements)
references
References 37 publications
0
12
0
Order By: Relevance
“…We describe a theorem that answers the first question: soft policy iteration [17] with a penalized value function Q is equivalent to policy iteration regularized by KL π(s) π p (s) where π p (a|s) = softmax − p(s, a) . This theorem is a generalized version of the theorem shown in [39], which does not require unnecessary assumptions on the penalty function.…”
Section: Theoretic Background On Direct Q-penalizationmentioning
confidence: 94%
See 4 more Smart Citations
“…We describe a theorem that answers the first question: soft policy iteration [17] with a penalized value function Q is equivalent to policy iteration regularized by KL π(s) π p (s) where π p (a|s) = softmax − p(s, a) . This theorem is a generalized version of the theorem shown in [39], which does not require unnecessary assumptions on the penalty function.…”
Section: Theoretic Background On Direct Q-penalizationmentioning
confidence: 94%
“…One straightforward solution to the over-estimation problem is directly penalizing Q estimation [39,6] with a penalty function p(s, a):…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations