2019
DOI: 10.48550/arxiv.1909.02506
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

$\sqrt{n}$-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

Abstract: In this paper, we consider the problem of online learning of Markov decision processes (MDPs) with very large state spaces. Under the assumptions of realizable function approximation and low Bellman ranks, we develop an online learning algorithm that learns the optimal value function while at the same time achieving very low cumulative regret during the learning process. Our learning algorithm, Adaptive Value-function Elimination (AVE), is inspired by the policy elimination algorithm proposed in [1], known as … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
7
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 30 publications
0
7
0
Order By: Relevance
“…Most of the existing work focus on the tabular setting; see e.g., Strehl et al (2006); Jaksch et al (2010); Osband et al (2014); Osband and Van Roy (2016); Azar et al (2017); Dann et al (2017); Agrawal and Jia (2017); Jin et al (2018); Russo (2019); Rosenberg and Mansour (2019a,b); Jin and Luo (2019); Zanette and Brunskill (2019); Simchowitz and Jamieson (2019); Dong et al (2019b) and the references therein. Under the function approximation setting, sample-efficient algorithms have been proposed using linear function approximators (Abbasi-Yadkori et al, 2019a,b;Yang and Wang, 2019a;Du et al, 2019b;Cai et al, 2019;Wang et al, 2019), as well as nonlinear ones (Wen and Van Roy, 2017;Jiang et al, 2017;Dann et al, 2018;Du et al, 2019b;Dong et al, 2019a;Du et al, 2019a). Among these results, our work is most related to ; ; Cai et al (2019), which consider linear MDP models and propose optimistic and randomized variants of least-squares value iteration (LSVI) (Bradtke and Barto, 1996;Osband et al, 2014) as well as optimistic variants of proximal policy optimization (Schulman et al, 2017).…”
Section: Related Workmentioning
confidence: 99%
“…Most of the existing work focus on the tabular setting; see e.g., Strehl et al (2006); Jaksch et al (2010); Osband et al (2014); Osband and Van Roy (2016); Azar et al (2017); Dann et al (2017); Agrawal and Jia (2017); Jin et al (2018); Russo (2019); Rosenberg and Mansour (2019a,b); Jin and Luo (2019); Zanette and Brunskill (2019); Simchowitz and Jamieson (2019); Dong et al (2019b) and the references therein. Under the function approximation setting, sample-efficient algorithms have been proposed using linear function approximators (Abbasi-Yadkori et al, 2019a,b;Yang and Wang, 2019a;Du et al, 2019b;Cai et al, 2019;Wang et al, 2019), as well as nonlinear ones (Wen and Van Roy, 2017;Jiang et al, 2017;Dann et al, 2018;Du et al, 2019b;Dong et al, 2019a;Du et al, 2019a). Among these results, our work is most related to ; ; Cai et al (2019), which consider linear MDP models and propose optimistic and randomized variants of least-squares value iteration (LSVI) (Bradtke and Barto, 1996;Osband et al, 2014) as well as optimistic variants of proximal policy optimization (Schulman et al, 2017).…”
Section: Related Workmentioning
confidence: 99%
“…Broadly speaking, our work is related to a vast body of work on value-based reinforcement learning in tabular (Jaksch et al, 2010;Osband et al, 2014;Osband and Van Roy, 2016;Azar et al, 2017;Dann et al, 2017;Strehl et al, 2006;Jin et al, 2018) and linear settings (Yang and Wang, 2019a,b;Jin et al, 2019), as well as nonlinear settings involving general function approximators (Wen and Van Roy, 2017;Jiang et al, 2017;Du et al, 2019b;Dong et al, 2019). In particular, our setting is the same as the linear setting studied by Jin et al (2019), which generalizes the one proposed by Yang and Wang (2019a,b).…”
Section: Related Workmentioning
confidence: 99%
“…In particular, our setting is the same as the linear setting studied by Jin et al (2019), which generalizes the one proposed by Yang and Wang (2019a,b). Also, our setting is a special case of the low-Bellman-rank setting studied by Jiang et al (2017); Dong et al (2019) with Bellman-rank at most d. In comparison, we focus on policy-based reinforcement learning, which is significantly less studied in theory. In particular, compared with optimistic LSVI (Jin et al, 2019), OPPO attains the same regret even in the presence of adversarially chosen reward functions.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…There are methods for more general, non-linear, function approximation, but these works either (a) require strong environment assumptions such as determinism (Wen and Van Roy, 2013;, (b) require strong function class assumptions such as bounded Eluder dimension (Russo and Van Roy, 2013;Osband and Van Roy, 2014), (c) have sample complexity scaling linearly with the function class size (Lattimore et al, 2013;Ortner et al, 2014) or (d) are computationally intractable (Jiang et al, 2017;Sun et al, 2019;Dong et al, 2019). Note that Ortner et al (2014); Jiang et al (2015) consider a form of representation learning, abstraction selection, but the former scales linearly with the number of candidate abstractions, while the latter does not address exploration.…”
Section: Related Workmentioning
confidence: 99%