2021
DOI: 10.48550/arxiv.2102.11866
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality

Abstract: Designing off-policy reinforcement learning algorithms is typically a very challenging task, because a desirable iteration update often involves an expectation over an on-policy distribution. Prior off-policy actor-critic (AC) algorithms have introduced a new critic that uses the density ratio for adjusting the distribution mismatch in order to stabilize the convergence, but at the cost of potentially introducing high biases due to the estimation errors of both the density ratio and value function. In this pap… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(18 citation statements)
references
References 24 publications
0
18
0
Order By: Relevance
“…[229] gave the first non-asymptotic convergence guarantee for the two time-scale natural Actor-Critic algorithms with mean-squared sample complexity of order O( 1(1−γ) 9 ε 4 ). For single-scale Actor-Critic methods, the global convergence with sublinear convergence rate was established in both [74] and [230]. The non-asymptotic convergence of policy-based algorithms is shown in other settings, see [239] for the regret analysis of the REINFORCE algorithm for discounted MDPs and [7,146] for policy gradient methods in the setting of known model parameters.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…[229] gave the first non-asymptotic convergence guarantee for the two time-scale natural Actor-Critic algorithms with mean-squared sample complexity of order O( 1(1−γ) 9 ε 4 ). For single-scale Actor-Critic methods, the global convergence with sublinear convergence rate was established in both [74] and [230]. The non-asymptotic convergence of policy-based algorithms is shown in other settings, see [239] for the regret analysis of the REINFORCE algorithm for discounted MDPs and [7,146] for policy gradient methods in the setting of known model parameters.…”
Section: Discussionmentioning
confidence: 99%
“…In the third way, the single-scale setting, the actor and the critic update their parameters simultaneously but with a much larger learning rate for the actor than for the critic (see, e.g. [74,230]).…”
Section: Actor-critic Methodsmentioning
confidence: 99%
“…With the help of the Fisher information, the optimality of natural actor critic (Kakade, 2001;Peters and Schaal, 2008;Bhatnagar et al, 2009) is also established in both on-policy settings (Agarwal et al, 2020;Wang et al, 2019;Liu et al, 2020;Khodadadian et al, 2021b) and off-policy settings (Khodadadian et al, 2021a;Chen et al, 2021a). Moreover, Xu et al (2021) establish the convergence to stationary points of an off-policy actor critic with density ratio correction and a fixed sampling distribution. To study the optimality of the stationary points, Xu et al (2021) also make some assumptions about the Fisher information.…”
Section: Related Workmentioning
confidence: 96%
“…the ratio between the state distribution of the target policy and that of the behavior policy (Hallak and Mannor, 2017;Gelada and Bellemare, 2019;Liu et al, 2018;Nachum et al, 2019;Zhang et al, 2020b), can be used to correct the state distribution mismatch between the behavior policy and the target policy. Consequently, convergence to stationary points of actor-critic methods in off-policy settings with density ratio has also been established (Liu et al, 2019;Zhang et al, 2020c;Huang and Jiang, 2021;Xu et al, 2021).…”
Section: Introductionmentioning
confidence: 95%
See 1 more Smart Citation