2021
DOI: 10.48550/arxiv.2102.01748
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Near-Optimal Offline Reinforcement Learning via Double Variance Reduction

Abstract: We consider the problem of offline reinforcement learning (RL) -a well-motivated setting of RL that aims at policy optimization using only historical data. Despite its wide applicability, theoretical understandings of offline RL, such as its optimal sample complexity, remain largely open even in basic settings such as tabular Markov Decision Processes (MDPs). In this paper, we propose Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
13
1

Year Published

2021
2021
2021
2021

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 10 publications
(15 citation statements)
references
References 15 publications
1
13
1
Order By: Relevance
“…Empirically, it can work on simulation control tasks (Kidambi et al, 2020;Yu et al, 2020;Kumar et al, 2020;Liu et al, 2020;Chang et al, 2021). On the theoretical side, pessimism allows us to obtain the PAC guarantee on various models when a comparator policy is covered by offline data in some forms (Jin et al, 2020b;Rashidinejad et al, 2021;Yin et al, 2021;Zanette et al, 2021b;Zhang et al, 2021b;Chang et al, 2021). However, these algorithms and their analysis rely on a known representation.…”
Section: Related Workmentioning
confidence: 99%
“…Empirically, it can work on simulation control tasks (Kidambi et al, 2020;Yu et al, 2020;Kumar et al, 2020;Liu et al, 2020;Chang et al, 2021). On the theoretical side, pessimism allows us to obtain the PAC guarantee on various models when a comparator policy is covered by offline data in some forms (Jin et al, 2020b;Rashidinejad et al, 2021;Yin et al, 2021;Zanette et al, 2021b;Zhang et al, 2021b;Chang et al, 2021). However, these algorithms and their analysis rely on a known representation.…”
Section: Related Workmentioning
confidence: 99%
“…Offline RL Offline/batch RL studies the case where the agent only has access to an offline dataset obtained by executing a behavior policy in the environment. Sample-efficient learning results in offline RL typically work by assuming either sup-concentrability assumptions [39,48,4,40,15,50,10,55]) or lower bounded exploration constants [57,58] to ensure the sufficient coverage of offline data over all (relevant) states and actions. However, such strong coverage assumptions can often fail to hold in practice [16].…”
Section: Related Workmentioning
confidence: 99%
“…However, such strong coverage assumptions can often fail to hold in practice [16]. More recent works address this by using either policy constraint/regularization [16,35,29,54], or the pessimism principle to optimize conservatively on the offline data [30,59,28,25,58,42]. The policy-constraint/regularization-based approaches prevent the policy to visit states and actions that has no or low coverage from the offline data.…”
Section: Related Workmentioning
confidence: 99%
“…The seminal idea of variance reduction was originally proposed to accelerate finite-sum stochastic optimization, e.g., Gower et al (2020); Johnson and Zhang (2013); Nguyen et al (2017). Thereafter, the variance reduction strategy has been imported to RL, which assists in improving the sample efficiency of RL algorithms in multiple contexts, including but not limited to policy evaluation (Du et al, 2017;Khamaru et al, 2020;Wai et al, 2019;Xu et al, 2019), RL with a generative model (Sidford et al, 2018a,b;Wainwright, 2019b), asynchronous Q-learning (Li et al, 2020b), and offline RL (Yin et al, 2021).…”
Section: Related Workmentioning
confidence: 99%