Zaiwei Chen scite author profile

This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous Reinforcement Learning (RL) algorithms. We do this by first reformulating the RL algorithms as Markovian Stochastic Approximation (SA) algorithms to solve fixed-point equations.We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this central result, we establish finite-sample mean-square convergence bounds for asynchronous RL algorithms such as Q-learning, n-step TD, TD(λ), and off-policy TD algorithms including V-trace. As a by-product, by analyzing the performance bounds of the TD(λ) (and n-step TD) algorithm for general λ (and n), we demonstrate a bias-variance trade-off, i.e., efficiency of bootstrapping in RL. This was first posed as an open problem in [37].

show abstract

Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning

Chen

Zhang

Doan

et al. 2022

Automatica

View full text Add to dashboard Cite

Finite-Sample Analysis of Nonlinear Stochastic Approximation with Applications in Reinforcement Learning

Chen¹,

Zhang²,

Doan³

et al. 2019

Preprint

View full text Add to dashboard Cite

Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear Function Approximation

Chen¹,

Khodadadian²,

Maguluri³

2021

Preprint

View full text Add to dashboard Cite

In this paper, we develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation and we establish a sample complexity of O(ǫ −3 ), outperforming all the previously known convergence bounds of such algorithms. In order to overcome the divergence due to deadly triad in off-policy policy evaluation under function approximation, we develop a critic that employs n-step TD-learning algorithm with a properly chosen n. We present finite-sample convergence bounds on this critic under both constant and diminishing step sizes, which are of independent interest. Furthermore, we develop a variant of natural policy gradient under function approximation, with an improved convergence rate of O(1/T ) after T iterations. Combining the finite sample error bounds of actor and the critic, we obtain the O(ǫ −3 ) sample complexity. We derive our sample complexity bounds solely based on the assumption that the behavior policy sufficiently explores all the states and actions, which is a much lighter assumption compared to the related literature. 2 sample complexity, which is the best known convergence bound in the literature for AC algorithms with function approximation.Novelty in the Critic. Off-policy TD with function approximation is famously [65] known to diverge due to deadly triad. To overcome this difficulty, we employ n-step TD-learning, and show that a proper choice of n naturally achieves convergence, and we present finite-sample bounds under both constant and diminishing stepsizes. To the best of our knowledge, we are the first to design a single time-scale off-policy TD with function approximation with provable finite-sample bounds.Novelty in the Actor. NAC under function approximation was developed in [1] by projecting the Qvalues (gradients) to the lower dimensional space, and this involves the use of the discounted state visitation distribution, which is hard to estimate. We develop a new NAC algorithm for the function approximation setting that is instead based on the solution of a projected Bellman equation [73], which our critic is designed to solve.Exploration through Off-Policy Sampling. We establish the convergence bounds under the minimum set of assumptions, viz.,ergodicity under the behavior policy, which ensures sufficient exploration, and thus resolving challenges faced in on-policy sampling. As a result, learning can be done using a single trajectory of samples generated by the behavior policy, and we do not require constant reset of the system that was introduced in on-policy AC algorithms [1, 75] to ensure exploration. A similar observation about employing off-policy sampling to ensure exploration has been made in the tabular setting in [34].1.2. Related Literature. The two main approaches for learning an optimal policy in an RL problem are value space methods, such as Q-learning, and policy space methods, such as AC. The Q-learning algorithm proposed in [77] is perhaps the most well-known value space method. The asymptotic convergence of Qlearning was established in [1...

show abstract

Finite-Sample Analysis of Off-Policy Natural Actor–Critic With Linear Function Approximation

Chen

Khodadadian

Maguluri

2022

IEEE Control Syst. Lett.

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Zaiwei Chen

A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning

Finite-Sample Analysis of Nonlinear Stochastic Approximation with Applications in Reinforcement Learning

Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear Function Approximation

Finite-Sample Analysis of Off-Policy Natural Actor–Critic With Linear Function Approximation

Contact Info

Product

Resources

About