2016
DOI: 10.48550/arxiv.1603.08661
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Regret Analysis of the Anytime Optimally Confident UCB Algorithm

Tor Lattimore

Abstract: I introduce and analyse an anytime version of the Optimally Confident UCB (OCUCB) algorithm designed for minimising the cumulative regret in finitearmed stochastic bandits with subgaussian noise. The new algorithm is simple, intuitive (in hindsight) and comes with the strongest finite-time regret guarantees for a horizon-free algorithm so far. I also show a finite-time lower bound that nearly matches the upper bound.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 4 publications
0
2
0
Order By: Relevance
“…Theoretically, the epistemic uncertainty enables to converge to zero in tabular [193] and linear MDPs [194], [195] according to the theoretical results. In general MDPs, as the agent learns more about the environment, the uncertainty that encourages exploration gradually decreases to zero, then the confidence set of the MDP posterior will contain the true MDP with a high probability [196], [197].…”
Section: B Open Problemsmentioning
confidence: 99%
“…Theoretically, the epistemic uncertainty enables to converge to zero in tabular [193] and linear MDPs [194], [195] according to the theoretical results. In general MDPs, as the agent learns more about the environment, the uncertainty that encourages exploration gradually decreases to zero, then the confidence set of the MDP posterior will contain the true MDP with a high probability [196], [197].…”
Section: B Open Problemsmentioning
confidence: 99%
“…MOSS (Audibert and Bubeck, 2010) makes the confidence bound depend on the number of plays for each bandit by replacing log(t) with log(t/N i (t)) in Eq. 4, and policies similar to MOSS include OCUCB (Lattimore, 2016) and UCB* (Garivier et al, 2016). UCB † (Lattimore, 2018) improves upon the previous ones significantly by designing a more advanced log function component.…”
Section: Related Workmentioning
confidence: 99%