2020
DOI: 10.48550/arxiv.2003.00153
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning Near Optimal Policies with Low Inherent Bellman Error

Andrea Zanette,
Alessandro Lazaric,
Mykel Kochenderfer
et al.

Abstract: We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration. First we relate this condition to other common frameworks and show that it is strictly more general than the low rank (or linear) MDP assumption of prior work. Second we provide an algorithm with a high probability regret bound rOpIKq where H is the horizon, K is the nu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

2
25
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(27 citation statements)
references
References 15 publications
2
25
0
Order By: Relevance
“…Since the assumption of exact realizability does not typically hold in practice, a more recent line of work has begun to investigate algorithms for misspecified models. In particular, Crammer and Gentile (2013); Ghosh et al (2017); Lattimore et al (2020); Foster and Rakhlin (2020); Zanette et al (2020) consider a uniform ε-misspecified setting in which…”
Section: Misspecificationmentioning
confidence: 99%
See 1 more Smart Citation
“…Since the assumption of exact realizability does not typically hold in practice, a more recent line of work has begun to investigate algorithms for misspecified models. In particular, Crammer and Gentile (2013); Ghosh et al (2017); Lattimore et al (2020); Foster and Rakhlin (2020); Zanette et al (2020) consider a uniform ε-misspecified setting in which…”
Section: Misspecificationmentioning
confidence: 99%
“…The issue of adapting to unknown misspecification has not been addressed even for the stronger uniform notion (1). Indeed, previous efforts typically use prior knowledge of ε to encourage conservative exploration when misspecification is large; see Lattimore et al (2020, Appendix E), Foster and Rakhlin (2020, Section 5.1), Crammer andGentile (2013, Section 4.2), andZanette et al (2020) for examples. Naively adapting such schemes using, e.g., doubling tricks, presents difficulties because the quantities in Eq.…”
Section: Misspecificationmentioning
confidence: 99%
“…Related work on misspecified linear bandits. Recently, works on reinforcement learning with misspecified linear features (e.g., [12,45,17]) have renewed interest in the related misspecified linear bandits (e.g., [48], [24], [30], [15]) first introduced in [16]. In [16], the authors show that standard algorithms must suffer Ω(ǫT ) regret under an additive ǫ-perturbation of the linear model.…”
Section: Introductionmentioning
confidence: 99%
“…In [16], the authors show that standard algorithms must suffer Ω(ǫT ) regret under an additive ǫ-perturbation of the linear model. Recently, [48] propose a robust variant of OFUL [1] that requires knowing the misspecification parameter ǫ. In particular, their algorithm obtains a high-probability Õ(d √ T + ǫ √ dT ) regret bound.…”
Section: Introductionmentioning
confidence: 99%
“…To achieve this goal, function approximation, which uses a class of predefined functions to approximate either the value function or transition dynamic, has been widely studied in recent years. Specifically, a series of recent works (Jiang et al, 2017;Jin et al, 2019;Modi et al, 2020;Zanette et al, 2020;Ayoub et al, 2020;Zhou et al, 2020) have studied RL with linear function approximation with provable guarantees. They show that with linear function approximation, one can either obtain a sublinear regret bound against the optimal value function (Jin et al, 2019;Zanette et al, 2020;Ayoub et al, 2020;Zhou et al, 2020) or a polynomial sample complexity bound (Kakade et al, 2003) (Probably Approximately Correct (PAC) bound for short) in finding a near-optimal policy (Jiang et al, 2017;Modi et al, 2020).…”
Section: Introductionmentioning
confidence: 99%