Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks

Thanh, Nguyen-Tang,; Gupta, Sunil; Tran-The, Hung; Venkatesh, Svetha

doi:10.48550/arxiv.2103.06671

Cited by 3 publications

(3 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our ongoing work Li et al (2022) suggests that a new variant of pessimistic model-based algorithm is sample-optimal for a broader range of ε, which in turn motivates further investigation into whether model-free algorithms can accommodate a broader ε-range too without compromising sample efficiency. Moving beyond the tabular setting, it would be of great importance to extend the algorithmic and theoretical framework to accommodate low-complexity function approximation (Nguyen-Tang et al, 2021).…”

Section: Discussionmentioning

confidence: 99%

Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

Shi¹,

Li²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

Offline or batch reinforcement learning seeks to learn a near-optimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of model-based algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their model-free counterparts -which do not require explicit model estimation -have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the singlepolicy concentrability assumption which does not require the full coverage of the state-action space. In addition, a variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity. Altogether, this work highlights the efficiency of model-free algorithms in offline RL when used in conjunction with pessimism and variance reduction.

show abstract

Section: Discussionmentioning

confidence: 99%

Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

Shi¹,

Li²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…A more recent line of work has studied variants of fitted Q-iteration (FQI) using neural network approximation, and provided statistical guarantees under different notions of smoothness. For example, Fan et al [10] exploited the Hölder smoothness of the range of Bellman operator to derive bounds on estimation error; Nguyen-Tang et al [28] approximated deep ReLU networks using Besov classes; and Long et al [22] analyzed two-layer neural networks based on neural tangent kernels or Barron spaces. All these works contribute to the understanding of empirical success of deep reinforcement learning.…”

Section: Related Workmentioning

confidence: 99%

Optimal policy evaluation using kernel-based temporal difference methods

Duan,

Wang,

Wainwright

2021

Preprint

View full text Add to dashboard Cite

We study methods based on reproducing kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process (MRP). We study a regularized form of the kernel least-squares temporal difference (LSTD) estimate; in the population limit of infinite data, it corresponds to the fixed point of a projected Bellman operator defined by the associated reproducing kernel Hilbert space. The estimator itself is obtained by computing the projected fixed point induced by a regularized version of the empirical operator; due to the underlying kernel structure, this reduces to solving a linear system involving kernel matrices. We analyze the error of this estimate in the L 2 (µ)-norm, where µ denotes the stationary distribution of the underlying Markov chain. Our analysis imposes no assumptions on the transition operator of the Markov chain, but rather only conditions on the reward function and population-level kernel LSTD solutions. We use empirical process theory techniques to derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size n and the effective horizon H = (1 − γ) −1 . Whereas existing worstcase theory predicts cubic scaling (H 3 ) in the effective horizon, our theory reveals that there is in fact a much wider range of scalings, depending on the kernel, the stationary distribution, and the variance of the Bellman residual error. Notably, it is only parametric and near-parametric problems that can ever achieve the worst-case cubic scaling.

show abstract

“…Parallel to its practical significance, recently there is a surge of theoretical investigations towards offline RL via two threads: offline policy evaluation (OPE), where the goal is to estimate the value of a target (fixed) policy V π (Li et al, 2015;Jiang & Li, 2016;Wang et al, 2017;Liu et al, 2018;Kallus & Uehara, 2020Uehara & Jiang, 2019;Feng et al, 2019;Nachum et al, 2019;Xie et al, 2019;Yin & Wang, 2020;Kato et al, 2020;Duan et al, 2020;Feng et al, 2020;Zhang et al, 2020b;Kuzborskij et al, 2020;Wang et al, 2020b;Zhang et al, 2021;Uehara et al, 2021;Nguyen-Tang et al, 2021;Hao et al, 2021;Xiao et al, 2021) and offline (policy) learning which intends to output a nearoptimal policy (Antos et al, 2008a,b;Chen & Jiang, 2019;Le et al, 2019;Xie & Jiang, 2020a,b;Liu et al, 2020b;Hao et al, 2020;Zanette, 2020;Jin et al, 2020c;Hu et al, 2021;Yin et al, 2021a;Ren et al, 2021;Rashidinejad et al, 2021). Yin et al (2021b) initiates the studies for offline RL from the new perspective of uniform convergence in OPE (uniform OPE for short) which unifies OPE and offline learning tasks.…”

Section: Introductionmentioning

confidence: 99%

Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

Yin,

Wang

2021

Preprint

View full text Add to dashboard Cite

This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for finite horizon MDP) and provides a unified view towards optimal learning for several well-motivated offline tasks. Uniform OPE sup Π |Q π − Qπ | < ǫ (initiated by Yin et al. (2021b)) is a stronger measure than the point-wise (fixed policy) OPE and ensures offline policy learning when Π contains all policies (global policy class). In this paper, we establish an Ω(H 2 S/d m ǫ 2 ) lower bound (over model-based family) for the global uniform OPE, where d m is the minimal state-action probability induced by the behavior policy. Next, our main result establishes an episode complexity of Õ(H 2 /d m ǫ 2 ) for local uniform convergence that applies to all near-empirically optimal policies for the MDPs with stationary transition. This result implies the optimal sample complexity for offline learning and separates the local uniform OPE from the global case due to the extra S factor. Paramountly, the model-based method combining with our new analysis technique (singleton absorbing MDP) can be adapted to the new settings: offline task-agnostic and the offline reward-free with optimal complexity Õ(H 2 log(K)/d m ǫ 2 ) (K is the number of tasks) and Õ(H 2 S/d m ǫ 2 ) respectively, which provides a unified framework for simultaneously solving different offline RL problems.

show abstract

Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks

Cited by 3 publications

References 19 publications

Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

Optimal policy evaluation using kernel-based temporal difference methods

Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings

Contact Info

Product

Resources

About