2021
DOI: 10.48550/arxiv.2110.03375
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning Pessimism for Robust and Efficient Off-Policy Reinforcement Learning

Abstract: Popular off-policy deep reinforcement learning algorithms compensate for overestimation bias during temporal-difference learning by utilizing pessimistic estimates of the expected target returns. In this work, we propose a novel learnable penalty to enact such pessimism, based on a new way to quantify the critic's epistemic uncertainty. Furthermore, we propose to learn the penalty alongside the critic with dual TD-learning, a strategy to estimate and minimize the bias magnitude in the target returns. Our metho… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(7 citation statements)
references
References 21 publications
0
7
0
Order By: Relevance
“…Moreover, we implement the critic's ensemble as a single neural network, using linear non-fullyconnected layers evenly splitting the nodes and dropping the weight connections between the splits. Practically, when evaluated under the same hardware, this results in our algorithm running more than 2.4 times faster than the implementation from Chen et al (2021) while having a similar algorithmic complexity (see (Cetin and Celiktutan 2021)). We show that GPL significantly improves the performance and robustness of off-policy RL, concretely surpassing prior algorithms and setting new state-of-the-art results.…”
Section: Methodsmentioning
confidence: 89%
See 3 more Smart Citations
“…Moreover, we implement the critic's ensemble as a single neural network, using linear non-fullyconnected layers evenly splitting the nodes and dropping the weight connections between the splits. Practically, when evaluated under the same hardware, this results in our algorithm running more than 2.4 times faster than the implementation from Chen et al (2021) while having a similar algorithmic complexity (see (Cetin and Celiktutan 2021)). We show that GPL significantly improves the performance and robustness of off-policy RL, concretely surpassing prior algorithms and setting new state-of-the-art results.…”
Section: Methodsmentioning
confidence: 89%
“…An intuition for our results is that the relative difference between the off-policy and on-policy action-value predictions should always push β to counteract new bias stemming from model errors in the policy gradient action maximization, and thus improve over non-adaptive methods which are also affected by initial bias. We further validate dual TD-learning in (Cetin and Celiktutan 2021), comparing with optimizing β by minimizing the squared norm of the bias and by using the bandit-based optimization from TOP (Moskovitz et al 2021). We also note that integrating GPL adds non-trivial complexity by introducing an entirely new optimization step which could be unnecessary for low-dimensional and easy-exploration problems.…”
Section: Dual Td-learningmentioning
confidence: 92%
See 2 more Smart Citations
“…In contrast, our method relies directly on the bias estimation to adjust control hyperparameter. GPL-SAC [28] proposed a new overestimation control technique based on uncertainty estimates and adaptation technique simultaneously. Adaptation technique represents particular case (k = 1) of our approach.…”
Section: Process Of η Adaptationmentioning
confidence: 99%