2021
DOI: 10.48550/arxiv.2102.04168
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature

Abstract: This paper studies model-based bandit and reinforcement learning (RL) with nonlinear function approximations. We propose to study convergence to approximate local maxima because we show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward. For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOL), which provably converges to a local maximum with sample complexity that only depend… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
13
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
5

Relationship

2
3

Authors

Journals

citations
Cited by 7 publications
(13 citation statements)
references
References 12 publications
0
13
0
Order By: Relevance
“…Going beyond linear function approximation, systematic exploration strategies are developed based on structural assumptions on MDP such as low Bellman rank (Jiang et al, 2017) and block MDP (Du et al, 2019). These methods are either computationally intractable (Jiang et al, 2017;Sun et al, 2019;Ayoub et al, 2020;Zanette et al, 2020;Dong et al, 2021; or are only oracle efficient (Feng et al, 2020;Agarwal et al, 2020b). The recent work Feng et al (2021) provides a sample efficient approach with non-linear policies, however, the algorithm requires maintaining the functional form of all prior policies.…”
Section: Related Workmentioning
confidence: 99%
“…Going beyond linear function approximation, systematic exploration strategies are developed based on structural assumptions on MDP such as low Bellman rank (Jiang et al, 2017) and block MDP (Du et al, 2019). These methods are either computationally intractable (Jiang et al, 2017;Sun et al, 2019;Ayoub et al, 2020;Zanette et al, 2020;Dong et al, 2021; or are only oracle efficient (Feng et al, 2020;Agarwal et al, 2020b). The recent work Feng et al (2021) provides a sample efficient approach with non-linear policies, however, the algorithm requires maintaining the functional form of all prior policies.…”
Section: Related Workmentioning
confidence: 99%
“…They also showed that the eluder dimension is small in several settings, including generalized linear models and LQR. However, as shown in [Dong et al, 2021], the eluder dimension could be exponentially large even with a single ReLU neuron, which suggested the eluder dimension would face difficulty in dealing with neural network cases. The eluder dimension is only known to give non-trivial bounds for linear function classes and monotone functions of linear function classes.…”
Section: Related Workmentioning
confidence: 99%
“…These results rely on finite state space or exact linear approximations. Recently, sample efficient algorithms under non-linear function approximation settings are proposed [Wen and Van Roy, 2017, Dann et al, 2018, Du et al, 2019b, Dong et al, 2020, Wang et al, 2020a, Dong et al, 2021. Those algorithms are based on Bellman rank [Jiang et al, 2017], eluder dimension [Russo and Van Roy, 2013b], neural tangent kernel [Jacot et al, 2018, Du et al, 2019a, or sequential Rademacher complexity [Rakhlin et al, 2015a,b].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Our lower bound construction is inspired by the similar construction in Theorem 5.1 of Dong et al (2021).…”
Section: Diversity Of Non-linear Function Classesmentioning
confidence: 99%