2020
DOI: 10.48550/arxiv.2007.03760
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Near-Optimal Provable Uniform Convergence in Offline Policy Evaluation for Reinforcement Learning

Abstract: The Off-Policy Evaluation (OPE) aims at estimating the performance of target policy π using offline data rolled in by a logging policy µ. Intensive studies have been conducted and the recent marginalized importance sampling (MIS) achieves the sample efficiency for OPE. However, it is rarely known if uniform convergence guarantees in OPE can be obtained efficiently. In this paper, we consider this new question and reveal the comprehensive relationship between OPE and offline learning for the first time.For the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
15
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
7
1

Relationship

1
7

Authors

Journals

citations
Cited by 13 publications
(16 citation statements)
references
References 13 publications
0
15
1
Order By: Relevance
“…Theoretical analysis of offline RL can be traced back to Szepesvári and Munos [2005], under the uniform concentration assumption (analogue to Assumption 2.3). This assumption has been extensively investigated , Xie et al, 2020b, Yin et al, 2020, Ren et al, 2021. Recently, a line of works showed that the pessimism principle allows offline policy optimization under a much weaker assumption, single policy concentration, both in tabular case and with function approximation [Rashidinejad et al, 2021, Jin et al, 2021b, Zanette et al, 2021.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Theoretical analysis of offline RL can be traced back to Szepesvári and Munos [2005], under the uniform concentration assumption (analogue to Assumption 2.3). This assumption has been extensively investigated , Xie et al, 2020b, Yin et al, 2020, Ren et al, 2021. Recently, a line of works showed that the pessimism principle allows offline policy optimization under a much weaker assumption, single policy concentration, both in tabular case and with function approximation [Rashidinejad et al, 2021, Jin et al, 2021b, Zanette et al, 2021.…”
Section: Related Workmentioning
confidence: 99%
“…Assumption 2.1 is the weakest assumption and is the most straightforward extension of the single policy concentration in single-agent RL [Rashidinejad et al, 2021]. Assumption 2.3 generalizes the uniform policy concentration in single-agent RL [Yin et al, 2020]. Assumption 2.2 is sandwiched by Assumption 2.1 and Assumption 2.3 as Assumption 2.2 implies Assumption 2.1 and Assumption 2.3 implies Assumption 2.2.…”
Section: Offline Two-plaer Zero-sum Gamementioning
confidence: 99%
“…Offline RL Offline/batch RL studies the case where the agent only has access to an offline dataset obtained by executing a behavior policy in the environment. Sample-efficient learning results in offline RL typically work by assuming either sup-concentrability assumptions [39,48,4,40,15,50,10,55]) or lower bounded exploration constants [57,58] to ensure the sufficient coverage of offline data over all (relevant) states and actions. However, such strong coverage assumptions can often fail to hold in practice [16].…”
Section: Related Workmentioning
confidence: 99%
“…by using optimism to encourage visitation to unseen states and actions [9,27,19,41,21,5,22,12,23,52]. In contrast, offline RL does not allow interactive exploration, and sample-efficient policy optimization algorithms typically focus on optimizing an unbiased (or downward biased) estimator of the value function [39,48,4,40,10,55,35,57,25,42]. It is therefore of interest to ask whether these two types of algorithms and theories can be connected in any way.…”
Section: Introductionmentioning
confidence: 99%
“…Exciting advances have been made in designing stable and high-performing empirical offline RL algorithms (Fujimoto et al, 2019;Laroche et al, 2019;Wu et al, 2019;Kumar et al, 2019Kumar et al, , 2020Agarwal et al, 2020;Kidambi et al, 2020;Siegel et al, 2020;Liu et al, 2020;Yang and Nachum, 2021;Yu et al, 2021). On the theoretical front, recent works have proposed efficient algorithms with theoretical guarantees, based on the principle of pessimism in face of uncertainty (Liu et al, 2020;Buckman et al, 2020;Yu et al, 2020;Rashidinejad et al, 2021), or variance reduction (Yin et al, 2020(Yin et al, , 2021. Interesting readers are encouraged to check out these works and the references therein.…”
Section: Introductionmentioning
confidence: 99%