Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning

Yin, Ming; Bai, Yu; Wang, Yuxiang

doi:10.48550/arxiv.2007.03760

Cited by 13 publications

(16 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Theoretical analysis of offline RL can be traced back to Szepesvári and Munos [2005], under the uniform concentration assumption (analogue to Assumption 2.3). This assumption has been extensively investigated , Xie et al, 2020b, Yin et al, 2020, Ren et al, 2021. Recently, a line of works showed that the pessimism principle allows offline policy optimization under a much weaker assumption, single policy concentration, both in tabular case and with function approximation [Rashidinejad et al, 2021, Jin et al, 2021b, Zanette et al, 2021.…”

Section: Related Workmentioning

confidence: 99%

“…Assumption 2.1 is the weakest assumption and is the most straightforward extension of the single policy concentration in single-agent RL [Rashidinejad et al, 2021]. Assumption 2.3 generalizes the uniform policy concentration in single-agent RL [Yin et al, 2020]. Assumption 2.2 is sandwiched by Assumption 2.1 and Assumption 2.3 as Assumption 2.2 implies Assumption 2.1 and Assumption 2.3 implies Assumption 2.2.…”

Section: Offline Two-plaer Zero-sum Gamementioning

confidence: 99%

See 1 more Smart Citation

When is Offline Two-Player Zero-Sum Markov Game Solvable?

Cui¹,

Du²

2022

Preprint

View full text Add to dashboard Cite

We study what dataset assumption permits solving offline two-player zero-sum Markov game. In stark contrast to the offline single-agent Markov decision process, we show that the single strategy concentration assumption is insufficient for learning the Nash equilibrium (NE) strategy in offline two-player zero-sum Markov games. On the other hand, we propose a new assumption named unilateral concentration and design a pessimism-type algorithm that is provably efficient under this assumption. In addition, we show that the unilateral concentration assumption is necessary for learning an NE strategy. Furthermore, our algorithm can achieve minimax sample complexity without any modification for two widely studied settings: dataset with uniform concentration assumption and turn-based Markov game. Our work serves as an important initial step towards understanding offline multi-agent reinforcement learning. * While we assume deterministic rewards for simplicity, our results can be straightforwardly generalized to unknown stochastic rewards, as the major difficulty is in learning the transitions rather than learning the rewards.† Stochastic initial state is equivalent to an MDP with deterministic initial state by creating a dummy initial state which transit to the next state following that initial state distribution.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Offline Two-plaer Zero-sum Gamementioning

confidence: 99%

When is Offline Two-Player Zero-Sum Markov Game Solvable?

Cui¹,

Du²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Offline RL Offline/batch RL studies the case where the agent only has access to an offline dataset obtained by executing a behavior policy in the environment. Sample-efficient learning results in offline RL typically work by assuming either sup-concentrability assumptions [39,48,4,40,15,50,10,55]) or lower bounded exploration constants [57,58] to ensure the sufficient coverage of offline data over all (relevant) states and actions. However, such strong coverage assumptions can often fail to hold in practice [16].…”

Section: Related Workmentioning

confidence: 99%

“…by using optimism to encourage visitation to unseen states and actions [9,27,19,41,21,5,22,12,23,52]. In contrast, offline RL does not allow interactive exploration, and sample-efficient policy optimization algorithms typically focus on optimizing an unbiased (or downward biased) estimator of the value function [39,48,4,40,10,55,35,57,25,42]. It is therefore of interest to ask whether these two types of algorithms and theories can be connected in any way.…”

Section: Introductionmentioning

confidence: 99%

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Xie

Jiang

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy" µ close to the optimal policy π ⋆ in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with S states, A actions, and horizon length H. We first design a sharp offline reduction algorithmwhich simply executes µ and runs offline policy optimization on the collected dataset-that finds an ε near-optimal policy within O(H 3 SC ⋆ /ε 2 ) episodes, where C ⋆ is the single-policy concentrability coefficient between µ and π ⋆ . This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an Ω(H 3 S min{C ⋆ , A}/ε 2 ) sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the environment. This implies that-perhaps surprisingly-the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use µ. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where µ only satisfies concentrability partially up to a certain time step. Overall, our results offer a quantitative understanding on the benefit of a good reference policy, and make a step towards bridging offline and online RL.

show abstract

“…Exciting advances have been made in designing stable and high-performing empirical offline RL algorithms (Fujimoto et al, 2019;Laroche et al, 2019;Wu et al, 2019;Kumar et al, 2019Kumar et al, , 2020Agarwal et al, 2020;Kidambi et al, 2020;Siegel et al, 2020;Liu et al, 2020;Yang and Nachum, 2021;Yu et al, 2021). On the theoretical front, recent works have proposed efficient algorithms with theoretical guarantees, based on the principle of pessimism in face of uncertainty (Liu et al, 2020;Buckman et al, 2020;Yu et al, 2020;Rashidinejad et al, 2021), or variance reduction (Yin et al, 2020(Yin et al, , 2021. Interesting readers are encouraged to check out these works and the references therein.…”

Section: Introductionmentioning

confidence: 99%

Corruption-Robust Offline Reinforcement Learning

Zhang,

Chen,

Zhu

et al. 2021

Preprint

View full text Add to dashboard Cite

We study the adversarial robustness in offline reinforcement learning. Given a batch dataset consisting of tuples (s, a, r, s ), an adversary is allowed to arbitrarily modify fraction of the tuples. From the corrupted dataset the learner aims to robustly identify a near-optimal policy. We first show that a worst-case Ω(d ) optimality gap is unavoidable in linear MDP of dimension d, even if the adversary only corrupts the reward element in a tuple. This contrasts with dimension-free results in robust supervised learning and best-known lower-bound in the online RL setting with corruption. Next, we propose robust variants of the Least-Square Value Iteration (LSVI) algorithm utilizing robust supervised learning oracles, which achieve near-matching performances in cases both with and without full data coverage. The algorithm requires the knowledge of to design the pessimism bonus in the no-coverage case. Surprisingly, in this case, the knowledge of is necessary, as we show that being adaptive to unknown is impossible. This again contrasts with recent results on corruption-robust online RL and implies that robust offline RL is a strictly harder problem.Preprint. Under review.

show abstract

Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning

Cited by 13 publications

References 13 publications

When is Offline Two-Player Zero-Sum Markov Game Solvable?

When is Offline Two-Player Zero-Sum Markov Game Solvable?

Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning

Corruption-Robust Offline Reinforcement Learning

Contact Info

Product

Resources

About