1997
DOI: 10.1109/9.580874
|View full text |Cite
|
Sign up to set email alerts
|

An analysis of temporal-difference learning with function approximation

Abstract: We discuss the temporal-difference learning algorithm, as applied to approximating the cost-to-go function of an infinite-horizon discounted Markov chain. The algorithm we analyze updates parameters of a linear function approximator online during a single endless trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space. We present a proof of convergence (with probability one), a characterization of the limit of convergence, and a bound on the resulting approximation error. Furt… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

9
991
0
3

Year Published

1999
1999
2012
2012

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 1,037 publications
(1,003 citation statements)
references
References 19 publications
9
991
0
3
Order By: Relevance
“…One approach is to approximate the solution using an approximate value function with a compact representation. A common choice is the use of linear value functions as an approximation -value functions that are a linear combination of potentially non-linear basis functions (Bellman, Kalaba, & Kotkin, 1963;Sutton, 1988;Tsitsiklis & Van Roy, 1996b). Our work builds on the ideas of Parr (1999, 2000), by using factored (linear) value functions, where each basis function is restricted to some small subset of the domain variables.…”
Section: Factored Mdpsmentioning
confidence: 99%
See 1 more Smart Citation
“…One approach is to approximate the solution using an approximate value function with a compact representation. A common choice is the use of linear value functions as an approximation -value functions that are a linear combination of potentially non-linear basis functions (Bellman, Kalaba, & Kotkin, 1963;Sutton, 1988;Tsitsiklis & Van Roy, 1996b). Our work builds on the ideas of Parr (1999, 2000), by using factored (linear) value functions, where each basis function is restricted to some small subset of the domain variables.…”
Section: Factored Mdpsmentioning
confidence: 99%
“…An approach along the lines described above has been used in various papers, with several recent theoretical and algorithmic results (Schweitzer & Seidmann, 1985;Tsitsiklis & Van Roy, 1996b;Van Roy, 1998;Koller & Parr, 1999. However, these approaches suffer from a problem that we might call "norm incompatibility."…”
Section: Max-norm Projectionmentioning
confidence: 99%
“…When the behavior policy π 0 is different from the target policy π, a sufficient condition for contraction is that λ be close enough to 1; indeed, when λ tends to 1, the spectral radius of T λ tends to zero and can potentially balance an expansion of the projection Π 0 . In the off-policy case, when γ is sufficiently big, a small value of λ can make Π 0 T λ expansive (see [20] for an example in the case λ = 0) and off-policy LSPE(λ) will then diverge. Eventually, Equations (5) and (8) show that when λ = 1, both LSTD(λ) and LSPE(λ) asymptotically coincide (as T 1 V does not depend on V ).…”
Section: Off-policy Lspe(λ)mentioning
confidence: 99%
“…Eventually, we consider 2 experiments involving an MDP and a projection due to [20], in order to illustrate possible numerical issues when solving the projected fixed-point Eq. (5).…”
Section: Illustration Of the Algorithmsmentioning
confidence: 99%
See 1 more Smart Citation