2023
DOI: 10.1007/s10994-023-06303-2
|View full text |Cite
|
Sign up to set email alerts
|

On the sample complexity of actor-critic method for reinforcement learning with function approximation

Abstract: Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a ne… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
12
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 33 publications
(13 citation statements)
references
References 55 publications
1
12
0
Order By: Relevance
“…For example, Kumar et al. (2019) provided a convergence rate analysis for a nested‐loop Actor–Critic algorithm to the stationary point through quantifying the smallest number of actor updates k required to attain inf0mkfalse∥J(θ(k))false∥2<ε$\inf _{0\le m\le k}\Vert \nabla J(\theta ^{(k)})\Vert ^2&lt; \varepsilon$. We denote this smallest number as K .…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…For example, Kumar et al. (2019) provided a convergence rate analysis for a nested‐loop Actor–Critic algorithm to the stationary point through quantifying the smallest number of actor updates k required to attain inf0mkfalse∥J(θ(k))false∥2<ε$\inf _{0\le m\le k}\Vert \nabla J(\theta ^{(k)})\Vert ^2&lt; \varepsilon$. We denote this smallest number as K .…”
Section: Discussionmentioning
confidence: 99%
“…There are three main ways to execute the algorithm. In the nested‐loop setting (see, e.g., Kumar et al., 2019; Xu et al., 2020a), the actor updates the policy in the outer loop after the critic's repeated updates in the inner loop. The second way is the two time‐scale setting (see, e.g., Xu et al., 2020b), where the actor and the critic update their parameters simultaneously with different learning rates.…”
Section: The Basics Of Reinforcement Learningmentioning
confidence: 99%
“…Actor-critic methods [188] Actor-critic methods combine policy gradient and value function estimation. The actor learns the policy, while the critic estimates the value function to evaluate the policy's performance.…”
Section: Q-learning [185]mentioning
confidence: 99%
“…Compared with the on-policy algorithm, the off-policy algorithm can be used to explore the environment when collecting interactive data within the environment. Thus, it can improve the utilization efficiency of data and will not affect the performance of the final policy [21,22].…”
Section: On-policy Algorithm and Off-policy Algorithmmentioning
confidence: 99%