Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms

Xu, Tengyu; Wang, Zhe; Liang, Y. T.

doi:10.48550/arxiv.2004.12956

Cited by 14 publications

(38 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An immediate future work is to employ our analysis for a sample based setting implementation of NPG, also know as Natural actor-critic. Recently, there has been a line of work on the analysis of actor-critic type algorithms [23,28,32,33,14,13], where [13] characterizes the best convergence result of O(1/k 1/3 ). By employing the improved convergence rate of NPG proposed in this paper, we believe that it is possible to improve the rate of the stochastic variant.…”

Section: Discussionmentioning

confidence: 99%

On the Linear convergence of Natural Policy Gradient Algorithm

Khodadadian¹,

Jhunjhunwala²,

Varma³

et al. 2021

Preprint

View full text Add to dashboard Cite

Markov Decision Processes are classically solved using Value Iteration and Policy Iteration algorithms. Recent interest in Reinforcement Learning has motivated the study of methods inspired by optimization, such as gradient ascent. Among these, a popular algorithm is the Natural Policy Gradient, which is a mirror descent variant for MDPs. This algorithm forms the basis of several popular Reinforcement Learning algorithms such as Natural actor-critic, TRPO, PPO, etc, and so is being studied with growing interest. It has been shown that Natural Policy Gradient with constant step size converges with a sublinear rate of O(1/k) to the global optimal. In this paper, we present improved finite time convergence bounds, and show that this algorithm has geometric (also known as linear) asymptotic convergence rate. We further improve this convergence result by introducing a variant of Natural Policy Gradient with adaptive step sizes. Finally, we compare different variants of policy gradient methods experimentally.

show abstract

Section: Discussionmentioning

confidence: 99%

On the Linear convergence of Natural Policy Gradient Algorithm

Khodadadian¹,

Jhunjhunwala²,

Varma³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…While the asymptotic convergence of actor-critic methods including natural actor-critic is well-understood by using the ODE approach [5,20], their finite-time convergence is largely unknown until recently [22,31,43,45]. The authors in [22,31] provide the rates of actor-critic where the parameter of the critic is updated by using a number of collected samples instead of only one single sample.…”

Section: Related Workmentioning

confidence: 99%

“…Such a setting, referred to as batch actor-critic, cannot be implemented in an online fashion since at any iteration the critic has to implement the current policy in a number of time steps to collect enough data. A similar batch approach was used in [45,46] to study natural actor-critic and in [36] the TRPO algorithm, which is another variant of mirror descent. A different approach was taken in [23,40] to obtain finite time bounds, where a setting of iid sampled data is considered.…”

Section: Related Workmentioning

confidence: 99%

“…The Markov chain is time varying because the policy is being updated. To the best of our knowledge, the only papers in the literature that consider such a setting are [43,45] which study the actor-critic algorithm under function approximation. Although their results are remarkable, they make several assumptions on the space of approximation functions.…”

Section: Related Workmentioning

confidence: 99%

“…Actor-critic algorithms can be classified into batch vs. online. The simplest method for the critic to evaluate the policy is by collecting many samples and then to perform a batch update [45]. This type of batch update, however, requires simulations that need to be restarted in specific states, making its implementation appropriate in artificial environments such as Atari games [35], but not in scenarios that require the agent to "learn as they go".…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Finite Sample Analysis of Two-Time-Scale Natural Actor-Critic Algorithm

Khodadadian¹,

Doan²,

Romberg³

et al. 2021

Preprint

View full text Add to dashboard Cite

Actor-critic style two-time-scale algorithms are very popular in reinforcement learning, and have seen great empirical success. However, their performance is not completely understood theoretically. In this paper, we characterize the global convergence of an online natural actor-critic algorithm in the tabular setting using a single trajectory. Our analysis applies to very general settings, as we only assume that the underlying Markov chain is ergodic under all policies (the so-called Recurrence assumption). We employ -greedy sampling in order to ensure enough exploration.For a fixed exploration parameter , we show that the natural actor critic algorithm is O( 1 T 1/4 + ) close to the global optimum after T iterations of the algorithm.By carefully diminishing the exploration parameter as the iterations proceed, we also show convergence to the global optimum at a rate of O(1/T 1/6 ).

show abstract

The Comprehensive Model of Using In-Depth Consolidated Multimodal Learning to Study Trading Strategies in the Securities Market

Boyko

2022

Lecture Notes in Data Engineering, Computational Intelligence, and Decision Making

View full text Add to dashboard Cite

Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms

Cited by 14 publications

References 22 publications

On the Linear convergence of Natural Policy Gradient Algorithm

On the Linear convergence of Natural Policy Gradient Algorithm

Finite Sample Analysis of Two-Time-Scale Natural Actor-Critic Algorithm

The Comprehensive Model of Using In-Depth Consolidated Multimodal Learning to Study Trading Strategies in the Securities Market

Contact Info

Product

Resources

About