Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Yin, Ming; Bai, Yu; Wang, Yuxiang

doi:10.48550/arxiv.2102.01748

Cited by 10 publications

(15 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Empirically, it can work on simulation control tasks (Kidambi et al, 2020;Yu et al, 2020;Kumar et al, 2020;Liu et al, 2020;Chang et al, 2021). On the theoretical side, pessimism allows us to obtain the PAC guarantee on various models when a comparator policy is covered by offline data in some forms (Jin et al, 2020b;Rashidinejad et al, 2021;Yin et al, 2021;Zanette et al, 2021b;Zhang et al, 2021b;Chang et al, 2021). However, these algorithms and their analysis rely on a known representation.…”

Section: Related Workmentioning

confidence: 99%

Representation Learning for Online and Offline RL in Low-rank MDPs

Uehara

Zhang

Sun

2021

Preprint

View full text Add to dashboard Cite

This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE(Agarwal et al., 2020b)--the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB-Upper Confidence Bound driven REPresentation learning for RL, which significantly improves the sample complexity from O(A 9 d 7 /( 10 (1with d being the rank of the transition matrix (or dimension of the ground truth representation), A being the number of actions, and γ being the discount factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explorethen-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline data distribution.

show abstract

Section: Related Workmentioning

confidence: 99%

Representation Learning for Online and Offline RL in Low-rank MDPs

Uehara

Zhang

Sun

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Offline RL Offline/batch RL studies the case where the agent only has access to an offline dataset obtained by executing a behavior policy in the environment. Sample-efficient learning results in offline RL typically work by assuming either sup-concentrability assumptions [39,48,4,40,15,50,10,55]) or lower bounded exploration constants [57,58] to ensure the sufficient coverage of offline data over all (relevant) states and actions. However, such strong coverage assumptions can often fail to hold in practice [16].…”

Section: Related Workmentioning

confidence: 99%

“…However, such strong coverage assumptions can often fail to hold in practice [16]. More recent works address this by using either policy constraint/regularization [16,35,29,54], or the pessimism principle to optimize conservatively on the offline data [30,59,28,25,58,42]. The policy-constraint/regularization-based approaches prevent the policy to visit states and actions that has no or low coverage from the offline data.…”

Section: Related Workmentioning

confidence: 99%

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Xie

Jiang

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy" µ close to the optimal policy π ⋆ in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with S states, A actions, and horizon length H. We first design a sharp offline reduction algorithmwhich simply executes µ and runs offline policy optimization on the collected dataset-that finds an ε near-optimal policy within O(H 3 SC ⋆ /ε 2 ) episodes, where C ⋆ is the single-policy concentrability coefficient between µ and π ⋆ . This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an Ω(H 3 S min{C ⋆ , A}/ε 2 ) sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the environment. This implies that-perhaps surprisingly-the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use µ. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where µ only satisfies concentrability partially up to a certain time step. Overall, our results offer a quantitative understanding on the benefit of a good reference policy, and make a step towards bridging offline and online RL.

show abstract

“…The seminal idea of variance reduction was originally proposed to accelerate finite-sum stochastic optimization, e.g., Gower et al (2020); Johnson and Zhang (2013); Nguyen et al (2017). Thereafter, the variance reduction strategy has been imported to RL, which assists in improving the sample efficiency of RL algorithms in multiple contexts, including but not limited to policy evaluation (Du et al, 2017;Khamaru et al, 2020;Wai et al, 2019;Xu et al, 2019), RL with a generative model (Sidford et al, 2018a,b;Wainwright, 2019b), asynchronous Q-learning (Li et al, 2020b), and offline RL (Yin et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

Li¹,

Shi²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with S states, A actions and horizon length H, substantial progress has been achieved towards characterizing the minimax-optimal regret, which scales on the order of √ H 2 SAT (modulo log factors) with T the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g., S 6 A 4 poly(H) for existing model-free methods).To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity O(SAH), that achieves near-optimal regret as soon as the sample size exceeds the order of SA poly(H). In terms of this sample size requirement (also referred to the initial burnin cost), our method improves -by at least a factor of S 5 A 3 -upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration-exploitation trade-offs.

show abstract

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Cited by 10 publications

References 15 publications

Representation Learning for Online and Offline RL in Low-rank MDPs

Representation Learning for Online and Offline RL in Low-rank MDPs

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

Contact Info

Product

Resources

About