2019
DOI: 10.48550/arxiv.1901.00210
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

Abstract: Strong worst-case performance bounds for episodic reinforcement learning exist but fortunately in practice RL algorithms perform much better than such bounds would predict. Algorithms and theory that provide strong problemdependent bounds could help illuminate the key features of what makes a RL problem hard and reduce the barrier to using RL algorithms in practice. As a step towards this we derive an algorithm and analysis for finite horizon discrete MDPs with state-of-the-art worst-case regret bounds and sub… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
39
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(39 citation statements)
references
References 3 publications
0
39
0
Order By: Relevance
“…Though there has been a fair amount of research for preference based bandits (no state information), only very few works consider incorporating preference feedback in the reinforcement learning (RL) framework that optimizes the accumulated long-term reward of a suitably chosen reward function over a markov decision process (Singh et al, 2002;Ng et al, 2006;Talebi and Maillard, 2018;Ortner, 2020;Zhang and Ji, 2019;Zanette and Brunskill, 2019). However the classical RL setup considers accessibility of reward feedback per state-action pair whereas which might be impractical in many real world scenarios.…”
Section: Related Workmentioning
confidence: 99%
“…Though there has been a fair amount of research for preference based bandits (no state information), only very few works consider incorporating preference feedback in the reinforcement learning (RL) framework that optimizes the accumulated long-term reward of a suitably chosen reward function over a markov decision process (Singh et al, 2002;Ng et al, 2006;Talebi and Maillard, 2018;Ortner, 2020;Zhang and Ji, 2019;Zanette and Brunskill, 2019). However the classical RL setup considers accessibility of reward feedback per state-action pair whereas which might be impractical in many real world scenarios.…”
Section: Related Workmentioning
confidence: 99%
“…There has been a long line of research studying Markov Decision Process (MDP), which can be viewed as a single-agent version of Markov Game. Tabular MDP has been studied thoroughly in recent years [7,22,14,2,60,25,64]. Particularly, in the episodic setting, the minimax regret or sample complexity is achieved by both model-based [2] and model-free [64] methods, up to logarithmic factors.…”
Section: Introductionmentioning
confidence: 99%
“…Therefore, a natural extension is to devise an algorithm that can adapt to an unknown sparsity level. Moreover, it is well known that structural assumptions can lead to improved regret bounds, and previous works proposed algorithms whose regret depends on nontrivial structural properties of the problem (Maillard et al, 2014;Zanette & Brunskill, 2019;Foster et al, 2019;. Thus, it is interesting to understand what structural properties (beyond sparsity) affect the budgeted performance and how to design algorithms that adapt to such properties.…”
Section: Summary and Discussionmentioning
confidence: 99%
“…While CBM-UCBVI clearly demonstrates the analysis techniques and insights from applying the CBM principle to RL, it is of interest to combine it with an algorithm with order-optimal regret bounds of √ SAH 3 T (e.g., (Jin et al, 2018)) when B(t) = Ht, that is, in the standard RL setting (notice that T is the number of episodes and not the total number of time steps). We achieve this goal by performing a more refined analysis that uses tighter concentration results based on (Azar et al, 2017;Dann et al, 2019;Zanette & Brunskill, 2019). Indeed, doing so leads to tighter regret bounds by a √ H factor in the leading term (Full details on the algorithm and proofs can be found at Appendix F).…”
Section: Reinforcement Learningmentioning
confidence: 99%
See 1 more Smart Citation