2019
DOI: 10.48550/arxiv.1912.05830
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Provably Efficient Exploration in Policy Optimization

Qi Cai,
Zhuoran Yang,
Chi Jin
et al.

Abstract: While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an "optimistic version" of the policy gradient direction. This paper proves… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
88
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
4

Relationship

3
5

Authors

Journals

citations
Cited by 42 publications
(91 citation statements)
references
References 33 publications
3
88
0
Order By: Relevance
“…The other part is about transition kernels. As shown in Cai et al (2019); Ayoub et al (2020);Zhou et al (2020a), linear kernel MDPs as defined above cover several other MDPs studied in previous works, as special cases. For example, tabular MDPs with canonical basis (Cai et al, 2019;Ayoub et al, 2020;Zhou et al, 2020a), feature embedding of transition models (Yang and Wang, 2019a) and linear combination of base models (Modi et al, 2020)…”
Section: Model Assumptionsmentioning
confidence: 87%
See 2 more Smart Citations
“…The other part is about transition kernels. As shown in Cai et al (2019); Ayoub et al (2020);Zhou et al (2020a), linear kernel MDPs as defined above cover several other MDPs studied in previous works, as special cases. For example, tabular MDPs with canonical basis (Cai et al, 2019;Ayoub et al, 2020;Zhou et al, 2020a), feature embedding of transition models (Yang and Wang, 2019a) and linear combination of base models (Modi et al, 2020)…”
Section: Model Assumptionsmentioning
confidence: 87%
“…Since our model is non-stationary, we cannot ensure that the estimated Q-function is "optimistic in the face of uncertainty" as l k h ≤ 0 like the previous work (Jin et al, 2019b;Cai et al, 2019) in the stationary case. Thanks to the sliding window method, the model prediction error here can be upper bounded by the slight changes of parameters in the sliding window.…”
Section: Model Prediction Error Termmentioning
confidence: 93%
See 1 more Smart Citation
“…Our work is also closely related to a line of works that study RL algorithms with function approximations. There are many works Wang, 2019, 2020;Cai et al, 2019;Zanette et al, 2020a;Jin et al, 2020b;Ayoub et al, 2020;Zhou et al, 2020;Kakade et al, 2020) studying different RL problems with the (generalized) linear function approximation. Furthermore, Wang et al (2020b) studies an optimistic LSVI algorithm for general function approximation.…”
Section: Related Workmentioning
confidence: 99%
“…An important issue is the choice of the norm. L ∞ estimates have been popular in reinforcement learning for analyzing the tabular setting [32,4,17,33], linear models, [11,57,34] and kernel methods [60,61] (see discussions below). However, in the case we are considering, L ∞ estimates suffer from the curse of dimensionality with respect to the sample complexity, i.e.…”
Section: Introductionmentioning
confidence: 99%