Jalaj Bhandari scite author profile

Temporal difference learning (TD) is a simple iterative algorithm widely used for policy evaluation in Markov reward processes. Bhandari et al. prove finite time convergence rates for TD learning with linear function approximation. The analysis follows using a key insight that establishes rigorous connections between TD updates and those of online gradient descent. In a model where observations are corrupted by i.i.d. noise, convergence results for TD follow by essentially mirroring the analysis for online gradient descent. Using an information-theoretic technique, the authors also provide results for the case when TD is applied to a single Markovian data stream where the algorithm’s updates can be severely biased. Their analysis seamlessly extends to the study of TD learning with eligibility traces and Q-learning for high-dimensional optimal stopping problems.

show abstract

Global Optimality Guarantees For Policy Gradient Methods

Bhandari¹,

Russo²

2019

Preprint

104

View full text Add to dashboard Cite

Policy gradients methods are perhaps the most widely used class of reinforcement learning algorithms. These methods apply to complex, poorly understood, control problems by performing stochastic gradient descent over a parameterized class of polices. Unfortunately, even for simple control problems solvable by classical techniques, policy gradient algorithms face non-convex optimization problems and are widely understood to converge only to local minima. This work identifies structural properties -shared by finite MDPs and several classic control problemswhich guarantee that policy gradient objective function has no suboptimal local minima despite being non-convex. When these assumptions are relaxed, our work gives conditions under which any local minimum is near-optimal, where the error bound depends on a notion of the expressive capacity of the policy class.

show abstract

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

Bhandari¹,

Russo²,

Singal³

2018

Preprint

View full text Add to dashboard Cite

Temporal difference learning (TD) is a simple iterative algorithm used to estimate the value function corresponding to a given policy in a Markov decision process. Although TD is one of the most widely used algorithms in reinforcement learning, its theoretical analysis has proved challenging and few guarantees on its statistical efficiency are available. In this work, we provide a simple and explicit finite time analysis of temporal difference learning with linear function approximation. Except for a few key insights, our analysis mirrors standard techniques for analyzing stochastic gradient descent algorithms, and therefore inherits the simplicity and elegance of that literature. Final sections of the paper show how all of our main results extend to the study of TD learning with eligibility traces, known as TD(λ), and to Q-learning applied in high-dimensional optimal stopping problems.

show abstract

On Linear Convergence of Policy Gradient Methods for Finite MDPs

Bhandari¹,

Russo²

2020

Preprint

View full text Add to dashboard Cite

We revisit the finite time analysis of policy gradient methods in the simplest setting: finite state and action problems with a policy class consisting of all stochastic policies and with exact gradient evaluations. Some recent works have viewed these problems as instances of smooth nonlinear optimization problems, suggesting suggest small stepsizes and showing sublinear convergence rates. This note instead takes a policy iteration perspective and highlights that many versions of policy gradient succeed with extremely large stepsizes and attain a linear rate of convergence.

show abstract

On the tightness of an LP relaxation for rational optimization and its applications

Avadhanula

Bhandari

Goyal

et al. 2016

Operations Research Letters

View full text Add to dashboard Cite

We consider the problem of optimizing a linear rational function subject to totally unimodular (TU) constraints over {0, 1} variables. Such formulations arise in many applications including assortment optimization. We show that a natural extended LP relaxation of the problem is "tight". In other words, any extreme point corresponds to an integral solution. We also consider more general constraints that are not TU but obtained by adding an arbitrary constraint to the set of TU constraints. Using structural insights about extreme points, we present a polynomial time approximation scheme (PTAS) for the general problem.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jalaj Bhandari

A Finite Time Analysis of Temporal Difference Learning with Linear Function Approximation

Global Optimality Guarantees For Policy Gradient Methods

A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation

On Linear Convergence of Policy Gradient Methods for Finite MDPs

On the tightness of an LP relaxation for rational optimization and its applications

Contact Info

Product

Resources

About