Proceedings of the 26th Annual International Conference on Machine Learning 2009
DOI: 10.1145/1553374.1553501
|View full text |Cite
|
Sign up to set email alerts
|

Fast gradient-descent methods for temporal-difference learning with linear function approximation

Abstract: Sutton, Szepesvári and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional linear TD (on on-policy problems where TD is convergent), calling into question its practical utility. In this paper we introd… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
998
0
3

Year Published

2011
2011
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 621 publications
(1,004 citation statements)
references
References 17 publications
3
998
0
3
Order By: Relevance
“…Many variants of traditional RL exist (e.g., Barto et al, 1983;Watkins, 1989;Watkins and Dayan, 1992;Moore and Atkeson, 1993;Schwartz, 1993;Rummery and Niranjan, 1994;Singh, 1994;Baird, 1995;Kaelbling et al, 1995;Peng and Williams, 1996;Mahadevan, 1996;Tsitsiklis and van Roy, 1996;Bradtke et al, 1996;Santamaría et al, 1997;Prokhorov and Wunsch, 1997;Sutton and Barto, 1998;Wiering and Schmidhuber, 1998b;Baird and Moore, 1999;Meuleau et al, 1999;Morimoto and Doya, 2000;Bertsekas, 2001;Brafman and Tennenholtz, 2002;Abounadi et al, 2002;Lagoudakis and Parr, 2003;Sutton et al, 2008;Maei and Sutton, 2010;van Hasselt, 2012). Most are formulated in a probabilistic framework, and evaluate pairs of input and output (action) events (instead of input events only).…”
Section: Deep Fnns For Traditional Rl and Markov Decision Processes (mentioning
confidence: 99%
See 1 more Smart Citation
“…Many variants of traditional RL exist (e.g., Barto et al, 1983;Watkins, 1989;Watkins and Dayan, 1992;Moore and Atkeson, 1993;Schwartz, 1993;Rummery and Niranjan, 1994;Singh, 1994;Baird, 1995;Kaelbling et al, 1995;Peng and Williams, 1996;Mahadevan, 1996;Tsitsiklis and van Roy, 1996;Bradtke et al, 1996;Santamaría et al, 1997;Prokhorov and Wunsch, 1997;Sutton and Barto, 1998;Wiering and Schmidhuber, 1998b;Baird and Moore, 1999;Meuleau et al, 1999;Morimoto and Doya, 2000;Bertsekas, 2001;Brafman and Tennenholtz, 2002;Abounadi et al, 2002;Lagoudakis and Parr, 2003;Sutton et al, 2008;Maei and Sutton, 2010;van Hasselt, 2012). Most are formulated in a probabilistic framework, and evaluate pairs of input and output (action) events (instead of input events only).…”
Section: Deep Fnns For Traditional Rl and Markov Decision Processes (mentioning
confidence: 99%
“…6.3) into sequences of simpler subtasks that can be solved by memoryless policies learnable by reactive sub-agents. Recent HRL organizes potentially deep NN-based RL sub-modules into self-organizing, 2-dimensional motor control maps (Ring et al, 2011) inspired by neurophysiological findings (Graziano, 2009 (Williams, 1986(Williams, , 1988(Williams, , 1992aSutton et al, 1999a;Baxter and Bartlett, 2001;Aberdeen, 2003;Ghavamzadeh and Mahadevan, 2003;Kohl and Stone, 2004;Wierstra et al, 2008;Rückstieß et al, 2008;Peters and Schaal, 2008b,a;Sehnke et al, 2010;Grüttner et al, 2010;Wierstra et al, 2010;Peters, 2010;Grondman et al, 2012;Heess et al, 2012). Gradients of the total reward with respect to policies (NN weights) are estimated (and then exploited) through repeated NN evaluations.…”
Section: Deep Hierarchical Rl (Hrl) and Subgoal Learning With Fnns Anmentioning
confidence: 99%
“…which is modeled as a probability distribution in order to incorporate exploratory actions; for some special problems, the optimal solution to a control problem is actually a stochastic controller, see e.g., Sutton, McAllester, Singh, and Mansour (2000).…”
Section: General Assumptions and Problem Statementmentioning
confidence: 99%
“…This allows two variations of the previous algorithm which are known as the policy gradient theorem (Sutton et al, 2000) …”
Section: Policy Gradient Theorem and G(po)mdpmentioning
confidence: 99%
“…The RL methods developed so far can be categorized into two types: Policy iteration where policies are learned based on value function approximation [21,12] and policy search where policies are learned directly to maximize expected future rewards [24,4,22,8,15,26]. …”
Section: Introductionmentioning
confidence: 99%