2016
DOI: 10.1609/aaai.v30i1.10303
|View full text |Cite
|
Sign up to set email alerts
|

Increasing the Action Gap: New Operators for Reinforcement Learning

Abstract: This paper introduces new optimality-preserving operators on Q-functions. We first describe an operator for tabular representations, the consistent Bellman operator, which incorporates a notion of local policy consistency. We show that this local consistency leads to an increase in the action gap at each state; increasing this gap, we argue, mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. This operator can also be applied to discretized continuous space … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
236
0
1

Year Published

2017
2017
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 167 publications
(238 citation statements)
references
References 24 publications
1
236
0
1
Order By: Relevance
“…In this context, action-gaps play an important role, because when action-gaps are small accurately approximating action-values and recovering the highest valued action can become intractable. Approximating the value function becomes easier if action-gaps are large (Bellemare et al 2016b). However, Bellemare et al present new Bellman operators to increase the action-gap.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…In this context, action-gaps play an important role, because when action-gaps are small accurately approximating action-values and recovering the highest valued action can become intractable. Approximating the value function becomes easier if action-gaps are large (Bellemare et al 2016b). However, Bellemare et al present new Bellman operators to increase the action-gap.…”
Section: Discussionmentioning
confidence: 99%
“…Subsequently we present bounds on the action-gaps of the value function. If action-gaps are large, value function approximation becomes easier (Bellemare et al 2016b). However, if actiongaps collapse, then function approximation methods may not be able to recover the optimal action, because it lacks the necessary "resolution" to distinguish the optimal action from sub-optimal actions.…”
Section: Introductionmentioning
confidence: 99%
“…Many recent studies have shown that (deep) reinforcement learning (RL) algorithms can achieve great progress when making use of regularization, though they may be derived from different motivations, such as robust policy optimization (Schulman et al 2015(Schulman et al , 2017 or efficient exploration (Haarnoja et al 2017(Haarnoja et al , 2018a. According to the reformulation in , Advantage Learning (AL) (Bellemare et al 2016) can also be viewed as a variant of the Bellman optimality operator imposed by an implicit Kullback-Leibler (KL) regularization between two consecutive policies. And this KL penalty can help to reduce the policy search space for stable and efficient optimization.…”
Section: Introductionmentioning
confidence: 99%
“…Besides transformed into an implicit KL-regularized update, this operator can directly increase the gap between the optimal and suboptimal actions, called action gap. (Bellemare et al 2016) shows that increasing this gap is beneficial, and especially a large gap can mitigate the undesirable effects of estimation errors from the approximation function.…”
Section: Introductionmentioning
confidence: 99%
“…Under the assumption of Gaussian rewards, in (Lee, Defourny, and Powell 2013) the authors propose a Q-learning variant that corrects the positive bias of ME by subtracting a term that depends on the number of actions and the variance of the rewards. Since the positive bias of the maximum operator increases when there are multiple actions with an expected value that is close to the maximum one, a modified Bellman operator that reduces the bias by increasing the action gap (that is the difference between the best action value and the second best) was proposed in (Bellemare et al 2016).…”
Section: Introductionmentioning
confidence: 99%