Provably Efficient Exploration in Policy Optimization

Cai, Qi; Yang, Zhuoran; Jin, Chi; Wang, Zhaoran

doi:10.48550/arxiv.1912.05830

Cited by 42 publications

(91 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The other part is about transition kernels. As shown in Cai et al (2019); Ayoub et al (2020);Zhou et al (2020a), linear kernel MDPs as defined above cover several other MDPs studied in previous works, as special cases. For example, tabular MDPs with canonical basis (Cai et al, 2019;Ayoub et al, 2020;Zhou et al, 2020a), feature embedding of transition models (Yang and Wang, 2019a) and linear combination of base models (Modi et al, 2020)…”

Section: Model Assumptionsmentioning

confidence: 87%

“…Since our model is non-stationary, we cannot ensure that the estimated Q-function is "optimistic in the face of uncertainty" as l k h ≤ 0 like the previous work (Jin et al, 2019b;Cai et al, 2019) in the stationary case. Thanks to the sliding window method, the model prediction error here can be upper bounded by the slight changes of parameters in the sliding window.…”

Section: Model Prediction Error Termmentioning

confidence: 93%

“…For example, a line of research develops optimism-based value iteration algorithms that successfully handle (ii) and (iii), e.g., (Jiang et al, 2017;Jin et al, 2019b;Wang et al, 2019b;Zanette et al, 2020;Wang et al, 2020;Ayoub et al, 2020;Zhou et al, 2020a). Besides, Cai et al (2019); Agarwal et al (2020); Efroni et al (2020) address challenges (ii)-(iv) but fail to consider (i), and Zhou et al (2020b); Touati and Vincent (2020) tackle (i)-(iii) but leave (iv) open. More importantly, these four challenges are coupled together, which requires sophisticated algorithm design.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

Zhong¹,

Yang²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs). In this setting, both the reward function and the transition kernel are linear with respect to the given feature maps and are allowed to vary over time, as long as their respective parameter variations do not exceed certain variation budgets. We propose the periodically restarted optimistic policy optimization algorithm (PROPO), which is an optimistic policy optimization algorithm with linear function approximation. PROPO features two mechanisms: sliding-window-based policy evaluation and periodicrestart-based policy improvement, which are tailored for policy optimization in a non-stationary environment. In addition, only utilizing the technique of sliding window, we propose a value-iteration algorithm. We establish dynamic upper bounds for the proposed methods and a matching minimax lower bound which shows the (near-) optimality of the proposed methods. To our best knowledge, PROPO is the first provably efficient policy optimization algorithm that handles non-stationarity.

show abstract

Section: Model Assumptionsmentioning

confidence: 87%

Section: Model Prediction Error Termmentioning

confidence: 93%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

Zhong¹,

Yang²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Our work is also closely related to a line of works that study RL algorithms with function approximations. There are many works Wang, 2019, 2020;Cai et al, 2019;Zanette et al, 2020a;Jin et al, 2020b;Ayoub et al, 2020;Zhou et al, 2020;Kakade et al, 2020) studying different RL problems with the (generalized) linear function approximation. Furthermore, Wang et al (2020b) studies an optimistic LSVI algorithm for general function approximation.…”

Section: Related Workmentioning

confidence: 99%

On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game

Qiu

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

To achieve sample efficiency in reinforcement learning (RL), it necessitates efficiently exploring the underlying environment. Under the offline setting, addressing the exploration challenge lies in collecting an offline dataset with sufficient coverage. Motivated by such a challenge, we study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. Then, given any extrinsic reward, the agent computes the policy via a planning algorithm with offline data collected in the exploration phase. Moreover, we tackle this problem under the context of function approximation, leveraging powerful function approximators. Specifically, we propose to explore via an optimistic variant of the value-iteration algorithm incorporating kernel and neural function approximations, where we adopt the associated exploration bonus as the exploration reward. Moreover, we design exploration and planning algorithms for both single-agent MDPs and zero-sum Markov games and prove that our methods can achieve O(1/ε 2 ) sample complexity for generating a ε-suboptimal policy or ε-approximate Nash equilibrium when given an arbitrary extrinsic reward. To the best of our knowledge, we establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.

show abstract

“…An important issue is the choice of the norm. L ∞ estimates have been popular in reinforcement learning for analyzing the tabular setting [32,4,17,33], linear models, [11,57,34] and kernel methods [60,61] (see discussions below). However, in the case we are considering, L ∞ estimates suffer from the curse of dimensionality with respect to the sample complexity, i.e.…”

Section: Introductionmentioning

confidence: 99%

An $L^2$ Analysis of Reinforcement Learning in High Dimensions with Kernel and Neural Network Approximation

Long¹,

Han²,

Weinan³

2021

Preprint

View full text Add to dashboard Cite

Reinforcement learning (RL) algorithms based on high-dimensional function approximation have achieved tremendous empirical success in large-scale problems with an enormous number of states. However, most analysis of such algorithms gives rise to error bounds that involve either the number of states or the number of features. This paper considers the situation where the function approximation is made either using the kernel method or the two-layer neural network model, in the context of a fitted Q-iteration algorithm with explicit regularization. We establish an Õ(H 3 |A| 1 4 n − 1 4 ) bound for the optimal policy with Hn samples, where H is the length of each episode and |A| is the size of action space. Our analysis hinges on analyzing the L 2 error of the approximated Q-function using n data points. Even though this result still requires a finite-sized action space, the error bound is independent of the dimensionality of the state space.that we can obtain an Õ((1 − γ) −2 (n − 1 2(1+α) + γ K )) bound for the optimal policy with Kn samples where 0 < γ < 1 is the discount factor and K is the number of iterations. This result builds on the assumption

show abstract

Provably Efficient Exploration in Policy Optimization

Cited by 42 publications

References 33 publications

Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game

An $L^2$ Analysis of Reinforcement Learning in High Dimensions with Kernel and Neural Network Approximation

Contact Info

Product

Resources

About