2021
DOI: 10.48550/arxiv.2105.12540
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear Function Approximation

Abstract: In this paper, we develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation and we establish a sample complexity of O(ǫ −3 ), outperforming all the previously known convergence bounds of such algorithms. In order to overcome the divergence due to deadly triad in off-policy policy evaluation under function approximation, we develop a critic that employs n-step TD-learning algorithm with a properly chosen n. We present finite-sample convergence bounds on this critic … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
15
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(15 citation statements)
references
References 46 publications
0
15
0
Order By: Relevance
“…In this work, we focus on the optimality of naive actor critic algorithms that do not use second order information. With the help of the Fisher information, the optimality of natural actor critic (Kakade, 2001;Peters and Schaal, 2008;Bhatnagar et al, 2009) is also established in both on-policy settings (Agarwal et al, 2020;Wang et al, 2019;Liu et al, 2020;Khodadadian et al, 2021b) and off-policy settings (Khodadadian et al, 2021a;Chen et al, 2021a). Moreover, Xu et al (2021) establish the convergence to stationary points of an off-policy actor critic with density ratio correction and a fixed sampling distribution.…”
Section: Related Workmentioning
confidence: 97%
See 4 more Smart Citations
“…In this work, we focus on the optimality of naive actor critic algorithms that do not use second order information. With the help of the Fisher information, the optimality of natural actor critic (Kakade, 2001;Peters and Schaal, 2008;Bhatnagar et al, 2009) is also established in both on-policy settings (Agarwal et al, 2020;Wang et al, 2019;Liu et al, 2020;Khodadadian et al, 2021b) and off-policy settings (Khodadadian et al, 2021a;Chen et al, 2021a). Moreover, Xu et al (2021) establish the convergence to stationary points of an off-policy actor critic with density ratio correction and a fixed sampling distribution.…”
Section: Related Workmentioning
confidence: 97%
“…By contrast, we work on a general off-policy setting in that at any time step the behavior policy can always be arbitrarily different from the target policy. A weaker trackability of the critic can be obtained with the results from Chen et al (2021b) directly without using our extension (i.e., Theorem 2), as done by Chen et al (2021a); Khodadadian et al (2021a) in their analysis of a natural actor critic. However, since Chen et al (2021b) require both the dynamics of the Markov chain and the update operator to be fixed, Chen et al (2021a); Khodadadian et al (2021a) have to keep both the behavior policy and the target policy (actor) fixed when updating the critic.…”
Section: Convergence Of the Criticmentioning
confidence: 99%
See 3 more Smart Citations