2021
DOI: 10.48550/arxiv.2103.12923
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation

Abstract: Policy optimization methods are popular reinforcement learning algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts. However, the same properties also make them slow to converge and sample inefficient, as the on-policy requirement precludes data reuse and the incremental updates couple large iteration complexity into the sample complexity. These characteristics have been observed in experiments as well as in theory in the recent work of Agarwal et … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(10 citation statements)
references
References 22 publications
0
10
0
Order By: Relevance
“…Our work contributes to the theoretical investigations of policy-based methods in RL (Cai et al, 2020;Shani et al, 2020;Lancewicki et al, 2020;Fei et al, 2020;He et al, 2021;Zhong et al, 2021;Luo et al, 2021;Zanette et al, 2021). The most related policy-based method is proposed by Shani et al (2020), who also studies the episodic tabular MDPs with unknown transitions, stochastic losses, and bandit feedback.…”
Section: Related Workmentioning
confidence: 88%
“…Our work contributes to the theoretical investigations of policy-based methods in RL (Cai et al, 2020;Shani et al, 2020;Lancewicki et al, 2020;Fei et al, 2020;He et al, 2021;Zhong et al, 2021;Luo et al, 2021;Zanette et al, 2021). The most related policy-based method is proposed by Shani et al (2020), who also studies the episodic tabular MDPs with unknown transitions, stochastic losses, and bandit feedback.…”
Section: Related Workmentioning
confidence: 88%
“…To address this issue, we adopt the idea of policy cover, recently introduced in [Agarwal et al, 2020a, Zanette et al, 2021. Specifically, we spend the first T 0 rounds to find an exploratory (mixture) policy π cov (called policy cover) which tends to reach all possible directions of the feature space.…”
Section: The Linear Mdp Casementioning
confidence: 99%
“…Motivated by this issue, a line of recent works [Cai et al, 2020, Shani et al, 2020, Agarwal et al, 2020a, Zanette et al, 2021 equip policy optimization with global exploration by adding exploration bonuses to the update, and prove favorable guarantees even without making extra exploratory assumptions. Moreover, they all demonstrate some robustness aspect of policy optimization (such as being able to handle adversarial losses or a certain degree of model misspecification).…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations