TIDBD: Adapting Temporal-difference Step-sizes Through Stochastic Meta-descent

Kearney, Alex; Veeriah, Vivek; Travnik, Jaden B.; Sutton, Richard S.; Pilarski, Patrick M.

doi:10.48550/arxiv.1804.03334

Cited by 3 publications

(5 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The normalization of NOSID seems to make the method more robust to initial α values that are too high, as well as maintaining performance across a wider range of µ values. On the other hand, it does not shift the use- ful range of µ values up as much 6 , again potentially indicating more variation in optimal µ value across problems. Additionally, the normalization procedure of NOSID makes use of a hard maximum rather than a running maximum in normalizing the β update, which is not well suited for non-stationary state representations where the appropriate normalization may significantly change over time.…”

Section: Mountain Carmentioning

confidence: 96%

“…In addition, we extend the approach to AC (λ) and to vector-valued stepsizes as well as a "mixed" version which utilizes a combination of scalar and vector step-sizes. Also related are TIDBD and it's extension AutoTIDBD [5,6], to our knowledge the only prior work to investigate learning of vector step-sizes for RL. The authors focuses on TD(λ) for prediction, and explore both vector and scalar step-sizes.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Metatrace Actor-Critic: Online Step-Size Tuning by Meta-gradient Descent for Reinforcement Learning Control

Young

Wang²,

Taylor³

2019

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence

View full text Add to dashboard Cite

Reinforcement learning (RL) has had many successes in both "deep" and "shallow" settings. In both cases, significant hyperparameter tuning is often required to achieve good performance. Furthermore, when nonlinear function approximation is used, non-stationarity in the state representation can lead to learning instability. A variety of techniques exist to combat this -most notably large experience replay buffers or the use of multiple parallel actors. These techniques come at the cost of moving away from the online RL problem as it is traditionally formulated (i.e., a single agent learning online without maintaining a large database of training examples). Meta-learning can potentially help with both these issues by tuning hyperparameters online and allowing the algorithm to more robustly adjust to non-stationarity in a problem. This paper applies meta-gradient descent to derive a set of step-size tuning algorithms specifically for online RL control with eligibility traces. Our novel technique, Metatrace, makes use of an eligibility trace analogous to methods like T D(λ). We explore tuning both a single scalar step-size and a separate step-size for each learned parameter. We evaluate Metatrace first for control with linear function approximation in the classic mountain car problem and then in a noisy, non-stationary version. Finally, we apply Metatrace for control with nonlinear function approximation in 5 games in the Arcade Learning Environment where we explore how it impacts learning speed and robustness to initial step-size choice. Results show that the meta-step-size parameter of Metatrace is easy to set, Metatrace can speed learning, and Metatrace can allow an RL algorithm to deal with non-stationarity in the learning task.

show abstract

Section: Mountain Carmentioning

confidence: 96%

Section: Related Workmentioning

confidence: 99%

Metatrace Actor-Critic: Online Step-Size Tuning by Meta-gradient Descent for Reinforcement Learning Control

Young

Wang²,

Taylor³

2019

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence

View full text Add to dashboard Cite

show abstract

“…We applied AdaGain, AMSGrad, RMSprop, SMD, and TIDBD (Kearney et al 2018)-a recent extension of the IDBD algorithm -to adapt the step-sizes of linear TD(λ) on Baird's counterexample. As before, the meta-parameters were extensively swept and the best performing parameters were used to generate the results for comparison.…”

Section: Experiments In Synthetic Tasksmentioning

confidence: 99%

“…Meta-descent applied to the step-size was first introduced for online least-mean squares methods (Jacobs 1988;Sutton 1992b;1992a;Almeida et al 1998;Mahmood et al 2012), including the linear complexity method IDBD (Sutton 1992b). IDBD was later extended to more general losses (Schraudolph 1999) and to support (semi-gradient) temporal difference methods (Dabney and Barto 2012;Dabney 2014;Kearney et al 2018). These methods are well-suited to non-stationary problems, and have been shown to ignore irrelevant features.…”

Section: Introductionmentioning

confidence: 99%

Meta-Descent for Online, Continual Prediction

Jacobsen

Schlegel

Linke

et al. 2019

AAAI

View full text Add to dashboard Cite

This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes. Many methods, including Ada-Grad, RMSProp, and AMSGrad, keep statistics about the learning process to approximate a second order update-a vector approximation of the inverse Hessian. Another family of approaches use meta-gradient descent to adapt the stepsize parameters to minimize prediction error. These metadescent strategies are promising for non-stationary problems, but have not been as extensively explored as quasi-second order methods. We first derive a general, incremental metadescent algorithm, called AdaGain, designed to be applicable to a much broader range of algorithms, including those with semi-gradient updates or even those with accelerations, such as RMSProp. We provide an empirical comparison of methods from both families. We conclude that methods from both families can perform well, but in non-stationary prediction problems the meta-descent methods exhibit advantages. Our method is particularly robust across several prediction problems, and is competitive with the state-of-the-art method on a large-scale, time-series prediction problem on real data from a mobile robot. 1 We exclude recent meta-learning frameworks (MAML (Finn, Abbeel, and Levine 2017), LTLGDGD (Andrychowicz et al. 2016)) because they assume access to a collection of tasks that can be sampled independently, enabling the agent to learn how to select meta-parameters for a new problem. In our setting, the agent must solve a large collection of non-stationary prediction problems in parallel using off-policy learning methods.

show abstract

“…Step-size schedules leading to optimal convergence rates and robust guarantees are well-known in the stochastic optimization literature (Nemirovski et al, 2009;Ghadimi & Lan, 2013), but they often depend on problem-dependent quantities that are unavailable to the practitioner. Meta-descent methods such as Stochatic Metadescent (Schraudolph, 1999) and TIDBD (Kearney et al, 2018) instead learn a step-size according to some metaobjective, thereby obviating the need to tune one. Yet these methods inevitably introduce new parameters to be tuned, such as a meta-stepsize, decay parameter parameter, etc, and can be quite sensitive to these new parameters in practice (Jacobsen et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

Parameter-free Gradient Temporal Difference Learning

Jacobsen,

Chan

2021

Preprint

View full text Add to dashboard Cite

Reinforcement learning lies at the intersection of several challenges. Many applications of interest involve extremely large state spaces, requiring function approximation to enable tractable computation. In addition, the learner has only a single stream of experience with which to evaluate a large number of possible courses of action, necessitating algorithms which can learn offpolicy. However, the combination of off-policy learning with function approximation leads to divergence of temporal difference methods. Recent work into gradient-based temporal difference methods has promised a path to stability, but at the cost of expensive hyperparameter tuning. In parallel, progress in online learning has provided parameter-free methods that achieve minimax optimal guarantees up to logarithmic terms, but their application in reinforcement learning has yet to be explored. In this work, we combine these two lines of attack, deriving parameter-free, gradient-based temporal difference algorithms. Our algorithms run in linear time and achieve high-probability convergence guarantees matching those of GTD2 up to log factors. Our experiments demonstrate that our methods maintain high prediction performance relative to fullytuned baselines, with no tuning whatsoever.

show abstract

TIDBD: Adapting Temporal-difference Step-sizes Through Stochastic Meta-descent

Cited by 3 publications

References 6 publications

Metatrace Actor-Critic: Online Step-Size Tuning by Meta-gradient Descent for Reinforcement Learning Control

Metatrace Actor-Critic: Online Step-Size Tuning by Meta-gradient Descent for Reinforcement Learning Control

Meta-Descent for Online, Continual Prediction

Parameter-free Gradient Temporal Difference Learning

Contact Info

Product

Resources

About