Kefan Dong scite author profile

Kefan Dong

4Publications

41Citation Statements Received

36Citation Statements Given

How they've been cited

How they cite others

Affiliations

Stanford University, University of Illinois Urbana-Champaign

Publications

Order By: Most citations

Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature

Dong

Yang

2021

Preprint

View full text Add to dashboard Cite

This paper studies model-based bandit and reinforcement learning (RL) with nonlinear function approximations. We propose to study convergence to approximate local maxima because we show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward. For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOL), which provably converges to a local maximum with sample complexity that only depends on the sequential Rademacher complexity of the model class. Our results imply novel global or local regret bounds on several concrete settings such as linear bandit with finite or sparse model class, and two-layer neural net bandit. A key algorithmic insight is that optimism may lead to over-exploration even for two-layer neural net model class. On the other hand, for convergence to local maxima, it suffices to maximize the virtual return if the model can also reasonably predict the size of the gradient and Hessian of the real return.

show abstract

Design of Experiments for Stochastic Contextual Linear Bandits

Zanette¹,

Dong²,

Brunskill³

2021

Preprint

View full text Add to dashboard Cite

Exploration via Hindsight Goal Generation

Ren

Dong

Zhou³

et al. 2019

Preprint

View full text Add to dashboard Cite

$\sqrt{n}$-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

Dong¹,

Peng²,

Zhou³

2019

Preprint

View full text Add to dashboard Cite

In this paper, we consider the problem of online learning of Markov decision processes (MDPs) with very large state spaces. Under the assumptions of realizable function approximation and low Bellman ranks, we develop an online learning algorithm that learns the optimal value function while at the same time achieving very low cumulative regret during the learning process. Our learning algorithm, Adaptive Value-function Elimination (AVE), is inspired by the policy elimination algorithm proposed in [1], known as OLIVE. One of our key technical contributions in AVE is to formulate the elimination steps in OLIVE as contextual bandit problems. This technique enables us to apply the active elimination and expert weighting methods from [2], instead of the random action exploration scheme used in the original OLIVE algorithm, for more efficient exploration and better control of the regret incurred in each policy elimination step. To the best of our knowledge, this is the first √ n-regret result for reinforcement learning in stochastic MDPs with general value function approximation.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Kefan Dong

Provable Model-based Nonlinear Bandit and Reinforcement Learning: Shelve Optimism, Embrace Virtual Curvature

Design of Experiments for Stochastic Contextual Linear Bandits

Exploration via Hindsight Goal Generation

$\sqrt{n}$-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

Contact Info

Product

Resources

About