Fast LSTD Using Stochastic Approximation: Finite Time Analysis and Application to Traffic Control

Prashanth, L A; Korda, Nathaniel; Munos, Rémi

doi:10.1007/978-3-662-44851-9_5

Cited by 18 publications

(15 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Analysis of related algorithms: A number of papers analyze algorithms related to and inspired by the classic TD algorithm. First, among others, Antos et al [2008], , , Pires and Szepesvári [2012], Prashanth et al [2013] and Tu and Recht [2018] analyze least-squares temporal difference learning (LSTD). Yu and Bertsekas [2009] study the related least-squares policy iteration algorithm.…”

Section: Asymptotic Analysis Of Stochastic Approximationmentioning

confidence: 99%

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

Bhandari¹,

Russo²,

Singal³

2018

Preprint

View full text Add to dashboard Cite

Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process. Although TD is one of the most widely used algorithms in reinforcement learning, its theoretical analysis has proved challenging and few guarantees on its statistical efficiency are available. In this work, we provide a simple and explicit finite time analysis of temporal difference learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. Final sections of the paper show how all of our main results extend to the study of TD learning with eligibility traces, known as TD(λ), and to Q-learning applied in high-dimensional optimal stopping problems.

show abstract

Section: Asymptotic Analysis Of Stochastic Approximationmentioning

confidence: 99%

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

Bhandari¹,

Russo²,

Singal³

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…This paper is an extended version of an earlier work (see Prashanth et al 2014). This work corrects the errors in the earlier work by using significant deviations in the proofs, and includes additional simulation experiments.…”

mentioning

confidence: 84%

“…This work corrects the errors in the earlier work by using significant deviations in the proofs, and includes additional simulation experiments. Finally, by Narayanan and Szepesvári (2017), the authors list a few problems with the results and proofs in the conference version (Prashanth et al 2014), and the corrections incorporated in this work address the comments by Narayanan and Szepesvári (2017).…”

mentioning

confidence: 99%

Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling

Prashanth

Korda

Munos³

2021

Mach Learn

Self Cite

View full text Add to dashboard Cite

We propose a stochastic approximation (SA) based method with randomization of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our proposed scheme is equivalent to running regular temporal difference learning with linear function approximation, albeit with samples picked uniformly from a given dataset. Our method results in an O(d) improvement in complexity in comparison to LSTD, where d is the dimension of the data. We provide non-asymptotic bounds for our proposed method, both in high probability and in expectation, under the assumption that the matrix underlying the LSTD solution is positive definite. The latter assumption can be easily satisfied for the pathwise LSTD variant proposed by Lazaric (J Mach Learn Res 13:3041-3074, 2012). Moreover, we also establish that using our method in place of LSTD does not impact the rate of convergence of the approximate value function to the true value function. These rate results coupled with the low computational complexity of our method make it attractive for implementation in big data settings, where d is large. A similar low-complexity alternative for least squares regression is well-known as the stochastic gradient descent (SGD) algorithm. We provide finite-time bounds for SGD. We demonstrate the practicality of our method as an efficient alternative for pathwise LSTD empirically by combining it with the least squares policy iteration algorithm in a traffic signal control application. We also conduct another set of experiments that combines the SA-based low-complexity variant for least squares regression with the LinUCB algorithm for contextual bandits, using the large scale news recommendation dataset from Yahoo.

show abstract

“…The analysis is a bit different than the one in Lazaric et al [2010b] and the bound is weaker in terms of d and ν. Another recent result is by Prashanth et al [2014] that use stochastic approximation to solve LSTD(0), where the resulting algorithm is exactly TD(0) with random sampling (samples are drawn i.i.d. and not from a trajectory), and report a Markov design bound (the bound is computed only at the states used by the algorithm) of O( d nν ) for LSTD(0).…”

Section: Related Workmentioning

confidence: 99%

Finite-Sample Analysis of Proximal Gradient TD Algorithms

Liu,

Ghavamzadeh

et al. 2020

Preprint

View full text Add to dashboard Cite

In this paper, we show for the first time how gradient TD (GTD) reinforcement learning methods can be formally derived as true stochastic gradient algorithms, not with respect to their original objective functions as previously attempted, but rather using derived primal-dual saddle-point objective functions. We then conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and no finite-sample analysis had been attempted. Two novel GTD algorithms are also proposed, namely projected GTD2 and GTD2-MP, which use proximal "mirror maps" to yield improved convergence guarantees and acceleration, respectively. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

show abstract

Fast LSTD Using Stochastic Approximation: Finite Time Analysis and Application to Traffic Control

Cited by 18 publications

References 12 publications

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling

Finite-Sample Analysis of Proximal Gradient TD Algorithms

Contact Info

Product

Resources

About