2021
DOI: 10.48550/arxiv.2107.08346
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Haipeng Luo,
Chen-Yu Wei,
Chung-Wei Lee

Abstract: Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass the challenge of global exploration. To eliminate the need of such assumptions, in this work, we develop a general solution that adds dilated bonuses to the policy update to facilitate global exploration. To showcase the power and generality of this technique, we apply it to… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
12
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(12 citation statements)
references
References 8 publications
0
12
0
Order By: Relevance
“…Our work contributes to the theoretical investigations of policy-based methods in RL (Cai et al, 2020;Shani et al, 2020;Lancewicki et al, 2020;Fei et al, 2020;He et al, 2021;Zhong et al, 2021;Luo et al, 2021;Zanette et al, 2021). The most related policy-based method is proposed by Shani et al (2020), who also studies the episodic tabular MDPs with unknown transitions, stochastic losses, and bandit feedback.…”
Section: Related Workmentioning
confidence: 88%
“…Our work contributes to the theoretical investigations of policy-based methods in RL (Cai et al, 2020;Shani et al, 2020;Lancewicki et al, 2020;Fei et al, 2020;He et al, 2021;Zhong et al, 2021;Luo et al, 2021;Zanette et al, 2021). The most related policy-based method is proposed by Shani et al (2020), who also studies the episodic tabular MDPs with unknown transitions, stochastic losses, and bandit feedback.…”
Section: Related Workmentioning
confidence: 88%
“…Lemma 2.5 (Lemma 3.1 by Luo et al (2021a)). If {b k } K k=1 are non-negative, B k (s, a) is defined as in Eq.…”
Section: Dilated Bonuses For Policy Optimizationmentioning
confidence: 99%
“…Later, Luo et al (2021b) study bandit-feedback linear-Q MDPs with a simulator, providing an O(d 2/3 H 2 K 2/3 ) regret bound. A refined version (Luo et al, 2021a) considers simulator-free bandit-feedback linear MDPs, giving an O(d 2 H 4 K 14/15 ) bound. Meanwhile, provided with a good exploratory policy (see Footnote 4), these bounds improve to O(poly(d, H) K/λ 0 ) and O(poly(d, H)λ −4/7 0 K 6/7 ), respectively.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations