Increasing the Action Gap: New Operators for Reinforcement Learning

Bellemare, Marc G.; Ostrovski, Georg; Guez, Arthur; Thomas, Philip S.; Munos, Rémi

doi:10.1609/aaai.v30i1.10303

Cited by 167 publications

(238 citation statements)

References 24 publications

Supporting

Mentioning

236

Contrasting

Unclassified

Order By: Relevance

“…In this context, action-gaps play an important role, because when action-gaps are small accurately approximating action-values and recovering the highest valued action can become intractable. Approximating the value function becomes easier if action-gaps are large (Bellemare et al 2016b). However, Bellemare et al present new Bellman operators to increase the action-gap.…”

Section: Discussionmentioning

confidence: 99%

“…Subsequently we present bounds on the action-gaps of the value function. If action-gaps are large, value function approximation becomes easier (Bellemare et al 2016b). However, if actiongaps collapse, then function approximation methods may not be able to recover the optimal action, because it lacks the necessary "resolution" to distinguish the optimal action from sub-optimal actions.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

On Value Function Representation of Long Horizon Problems

Lehnert

Laroche

Seijen

2018

AAAI

View full text Add to dashboard Cite

In Reinforcement Learning, an intelligent agent has to make a sequence of decisions to accomplish a goal. If this sequence is long, then the agent has to plan over a long horizon. While learning the optimal policy and its value function is a well studied problem in Reinforcement Learning, this paper focuses on the structure of the optimal value function and how hard it is to represent the optimal value function. We show that the generalized Rademacher complexity of the hypothesis space of all optimal value functions is dependent on the planning horizon and independent of the state and action space size. Further, we present bounds on the action-gaps of action value functions and show that they can collapse if a long planning horizon is used. The theoretical results are verified empirically on randomly generated MDPs and on a grid-world fruit collection task using deep value function approximation. Our theoretical results highlight a connection between value function approximation and the Options framework and suggest that value functions should be decomposed along bottlenecks of the MDP's transition dynamics.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

On Value Function Representation of Long Horizon Problems

Lehnert

Laroche

Seijen

2018

AAAI

View full text Add to dashboard Cite

show abstract

“…Many recent studies have shown that (deep) reinforcement learning (RL) algorithms can achieve great progress when making use of regularization, though they may be derived from different motivations, such as robust policy optimization (Schulman et al 2015(Schulman et al , 2017 or efficient exploration (Haarnoja et al 2017(Haarnoja et al , 2018a. According to the reformulation in , Advantage Learning (AL) (Bellemare et al 2016) can also be viewed as a variant of the Bellman optimality operator imposed by an implicit Kullback-Leibler (KL) regularization between two consecutive policies. And this KL penalty can help to reduce the policy search space for stable and efficient optimization.…”

Section: Introductionmentioning

confidence: 99%

“…Besides transformed into an implicit KL-regularized update, this operator can directly increase the gap between the optimal and suboptimal actions, called action gap. (Bellemare et al 2016) shows that increasing this gap is beneficial, and especially a large gap can mitigate the undesirable effects of estimation errors from the approximation function.…”

Section: Introductionmentioning

confidence: 99%

Robust Action Gap Increasing with Clipped Advantage Learning

Zhang

Gan

Tan

2022

AAAI

View full text Add to dashboard Cite

Advantage Learning (AL) seeks to increase the action gap between the optimal action and its competitors, so as to improve the robustness to estimation errors. However, the method becomes problematic when the optimal action induced by the approximated value function does not agree with the true optimal action. In this paper, we present a novel method, named clipped Advantage Learning (clipped AL), to address this issue. The method is inspired by our observation that increasing the action gap blindly for all given samples while not taking their necessities into account could accumulate more errors in the performance loss bound, leading to a slow value convergence, and to avoid that, we should adjust the advantage value adaptively. We show that our simple clipped AL operator not only enjoys fast convergence guarantee but also retains proper action gaps, hence achieving a good balance between the large action gap and the fast convergence. The feasibility and effectiveness of the proposed method are verified empirically on several RL benchmarks with promising performance.

show abstract

“…Under the assumption of Gaussian rewards, in (Lee, Defourny, and Powell 2013) the authors propose a Q-learning variant that corrects the positive bias of ME by subtracting a term that depends on the number of actions and the variance of the rewards. Since the positive bias of the maximum operator increases when there are multiple actions with an expected value that is close to the maximum one, a modified Bellman operator that reduces the bias by increasing the action gap (that is the difference between the best action value and the second best) was proposed in (Bellemare et al 2016).…”

Section: Introductionmentioning

confidence: 99%

Estimating the Maximum Expected Value in Continuous Reinforcement Learning Problems

D’Eramo

Nuara

Pirotta

et al. 2017

AAAI

View full text Add to dashboard Cite

This paper is about the estimation of the maximum expected value of an infinite set of random variables.This estimation problem is relevant in many fields, like the Reinforcement Learning (RL) one.In RL it is well known that, in some stochastic environments, a bias in the estimation error can increase step-by-step the approximation error leading to large overestimates of the true action values. Recently, some approaches have been proposed to reduce such bias in order to get better action-value estimates, but are limited to finite problems.In this paper, we leverage on the recently proposed weighted estimator and on Gaussian process regression to derive a new method that is able to natively handle infinitely many random variables.We show how these techniques can be used to face both continuous state and continuous actions RL problems.To evaluate the effectiveness of the proposed approach we perform empirical comparisons with related approaches.

show abstract

Increasing the Action Gap: New Operators for Reinforcement Learning

Cited by 167 publications

References 24 publications

On Value Function Representation of Long Horizon Problems

On Value Function Representation of Long Horizon Problems

Robust Action Gap Increasing with Clipped Advantage Learning

Estimating the Maximum Expected Value in Continuous Reinforcement Learning Problems

Contact Info

Product

Resources

About