2021
DOI: 10.48550/arxiv.2112.10935
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Abstract: Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in Shani et al. ( 2020) is only O( √ S 2 AH 4 K) where S is the number of states, A is the number of actions, H is the horizon, and K is the number of episodes, and there is a √ SH gap compared with the infor… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 18 publications
0
2
0
Order By: Relevance
“…This cost is often of order Õ( √ K) and is one of the dominating terms in the regret bound; see e.g. (Shani et al, 2020;Wu et al, 2021) for the finite-horizon case. For SSP, this is undesirable because it also depends on T ⋆ or even T max .…”
Section: Analysis Highlightsmentioning
confidence: 99%
“…This cost is often of order Õ( √ K) and is one of the dominating terms in the regret bound; see e.g. (Shani et al, 2020;Wu et al, 2021) for the finite-horizon case. For SSP, this is undesirable because it also depends on T ⋆ or even T max .…”
Section: Analysis Highlightsmentioning
confidence: 99%
“…Variance reduction via reference function originates from the optimization literature [Johnson and Zhang, 2013]. Recently, several RL works [Zhang et al, 2020, Wu et al, 2021, Xie et al, 2021a, Cui and Du, 2022 also adopt this technique to obtain sharper bounds. Nevertheless, these works only focus on the tabular setting, and the linear case is more complicated and thus requires refined analysis.…”
Section: Introductionmentioning
confidence: 99%