Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Luo, Haipeng; Wei, Chen-Yu; Lee, Chung-Wei

doi:10.48550/arxiv.2107.08346

Cited by 2 publications

(12 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our work contributes to the theoretical investigations of policy-based methods in RL (Cai et al, 2020;Shani et al, 2020;Lancewicki et al, 2020;Fei et al, 2020;He et al, 2021;Zhong et al, 2021;Luo et al, 2021;Zanette et al, 2021). The most related policy-based method is proposed by Shani et al (2020), who also studies the episodic tabular MDPs with unknown transitions, stochastic losses, and bandit feedback.…”

Section: Related Workmentioning

confidence: 88%

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Wu¹,

Yang²,

Han³

et al. 2021

Preprint

View full text Add to dashboard Cite

Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in Shani et al. ( 2020) is only O( √ S 2 AH 4 K) where S is the number of states, A is the number of actions, H is the horizon, and K is the number of episodes, and there is a √ SH gap compared with the information theoretic lower bound Ω( √ SAH 3 K) (Jin et al., 2018). To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (RPO-SAT), which features the property "Stable at Any Time". We prove that our algorithm achieves O( √ SAH 3 K + √ AH 4 K) regret. When S > H, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL.

show abstract

Section: Related Workmentioning

confidence: 88%

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Wu¹,

Yang²,

Han³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Lemma 2.5 (Lemma 3.1 by Luo et al (2021a)). If {b k } K k=1 are non-negative, B k (s, a) is defined as in Eq.…”

Section: Dilated Bonuses For Policy Optimizationmentioning

confidence: 99%

“…Later, Luo et al (2021b) study bandit-feedback linear-Q MDPs with a simulator, providing an O(d 2/3 H 2 K 2/3 ) regret bound. A refined version (Luo et al, 2021a) considers simulator-free bandit-feedback linear MDPs, giving an O(d 2 H 4 K 14/15 ) bound. Meanwhile, provided with a good exploratory policy (see Footnote 4), these bounds improve to O(poly(d, H) K/λ 0 ) and O(poly(d, H)λ −4/7 0 K 6/7 ), respectively.…”

Section: Related Workmentioning

confidence: 99%

“…for s ∈ S h (which again uses the simulator; see Algorithm 4 as well as detailed discussions in (Luo et al, 2021a)).…”

Section: Dilated Bonuses For Policy Optimizationmentioning

confidence: 99%

“…At last, we also apply our method to simulator-free linear MDPs (formally defined in Theorem 2.4), yielding an efficient algorithm with O(K 8/9 ) regret and greatly outperforming the best existing bound O(K 14/15 ) (Luo et al, 2021a). 3 Remarkably, in this application, we not only use our refined analysis for FTRL with the log-barrier regularizer, but also develop a more sample-efficient alternative to the Matrix Geometric Resampling (MGR) method introduced by Neu and Olkhovskaya (2020) and later adopted by Neu and Olkhovskaya (2021) and Luo et al (2021a), which could also be of independent interest.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Refined Regret for Adversarial MDPs with Linear Function Approximation

Yan¹,

Luo²,

Wei³

et al. 2023

Preprint

View full text Add to dashboard Cite

We consider learning in an adversarial Markov Decision Process (MDP) where the loss functions can change arbitrarily over K episodes and the state space can be arbitrarily large. We assume that the Q-function of any policy is linear in some known features, that is, a linear function approximation exists. The best existing regret upper bound for this setting (Luo et al., 2021b) is of order O(K 2/3 ) (omitting all other dependencies), given access to a simulator. This paper provides two algorithms that improve the regret to O( √ K) in the same setting. Our first algorithm makes use of a refined analysis of the Follow-the-Regularized-Leader (FTRL) algorithm with the logbarrier regularizer. This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest. Our second algorithm develops a magnitude-reduced loss estimator, further removing the polynomial dependency on the number of actions in the first algorithm and leading to the optimal regret bound (up to logarithmic terms and dependency on the horizon). Moreover, we also extend the first algorithm to simulator-free linear MDPs, which achieves O(K 8/9 ) regret and greatly improves over the best existing bound O(K 14/15 ). This algorithm relies on a better alternative to the Matrix Geometric Resampling procedure by Neu and Olkhovskaya (2020), which could again be of independent interest.

show abstract

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Cited by 2 publications

References 8 publications

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Refined Regret for Adversarial MDPs with Linear Function Approximation

Contact Info

Product

Resources

About