2019
DOI: 10.48550/arxiv.1910.08412
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation

Abstract: Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a ne… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
52
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 29 publications
(54 citation statements)
references
References 21 publications
2
52
0
Order By: Relevance
“…Based on recent progress in non-convex optimization, non-asymptotic analysis of policy-based methods were first established for convergence to a stationary point. For example, [125] provided a convergence rate analysis for a nested-loop Actor-Critic algorithm to the stationary point through quantifying the smallest number of actor updates k required to attain inf 0≤m≤k ∇J(θ (k) ) 2 < ε. We denote this smallest number as K. When the actor uses a policy gradient method, the critic achieves K ≤ O (1/ε 4 ) by employing TD(0), K ≤ O (1/ε 3 ) by employing the Gradient Temporal Difference, and K ≤ O (1/ε 5/2 ) by employing the Accelerated Gradient Temporal Difference, with continuous state and action spaces.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Based on recent progress in non-convex optimization, non-asymptotic analysis of policy-based methods were first established for convergence to a stationary point. For example, [125] provided a convergence rate analysis for a nested-loop Actor-Critic algorithm to the stationary point through quantifying the smallest number of actor updates k required to attain inf 0≤m≤k ∇J(θ (k) ) 2 < ε. We denote this smallest number as K. When the actor uses a policy gradient method, the critic achieves K ≤ O (1/ε 4 ) by employing TD(0), K ≤ O (1/ε 3 ) by employing the Gradient Temporal Difference, and K ≤ O (1/ε 5/2 ) by employing the Accelerated Gradient Temporal Difference, with continuous state and action spaces.…”
Section: Discussionmentioning
confidence: 99%
“…In the nested-loop setting (see, e.g. [125,228]), the actor updates the policy in the outer loop after the critic's repeated updates in the inner loop. The second way is the two time-scale setting (see, e.g.…”
Section: Actor-critic Methodsmentioning
confidence: 99%
“…Second, in the RL literature actor-critic algorithms also aim to solve a problem similar to (4), where θ and ω are referred to as the actor and critic parameters, respectively; see for example [17,20,26,[29][30][31]. Among these works, only the works in [17,29] consider an online setting similar to the one studied in this paper.…”
Section: Related Workmentioning
confidence: 99%
“…The finite-time error bounds for the gradient TD algorithms [Maei et al, 2010, Maei et al, 2010 were further developed recently in [Dalal et al, 2018b, Liu et al, 2015, Gupta et al, 2019, Xu et al, 2019, Dalal et al, 2020, Kaledin et al, 2020, Wang and Zou, 2020, Ma et al, 2021. There are also finite-time error bounds on the policy gradient methods and actor critic methods, e.g., , Kumar et al, 2019, Qiu et al, 2019, Wu et al, 2020, Cen et al, 2020, Bhandari and Russo, 2019, Agarwal et al, 2019, Mei et al, 2020. We note that these studies are for the non-robust RL algorithms, and in this paper, we design robust RL algorithms, and characterize their finite-time error bounds.…”
Section: Related Workmentioning
confidence: 99%