2013
DOI: 10.1214/13-aos1119
|View full text |Cite
|
Sign up to set email alerts
|

Kullback–Leibler upper confidence bounds for optimal sequential allocation

Abstract: We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins [J. R. Stat. Soc. Ser. B Stat. Methodol. 41 (1979) 148-177], based on upper confidence bounds of the arm payoffs computed using the Kullback-Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: the kl-UCB algorithm is designed for oneparameter exponential families and the empiric… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

5
309
0
2

Year Published

2013
2013
2022
2022

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 256 publications
(316 citation statements)
references
References 30 publications
5
309
0
2
Order By: Relevance
“…Notice that [81] and [37] also provided an algorithm with asymptotic guarantees (under more restrictive conditions). It is only in [54,85,39] that a finite-time analysis was derived for KL-based UCB algorithms, KL-UCB and K inf -UCB, that achieve the asymptotic lower bounds of [81] and [37] respectively. Those algorithms make use of KL divergences in the definition of the UCBs and use the full empirical reward distribution (and not only the two first moments).…”
Section: Recent Improvementsmentioning
confidence: 99%
“…Notice that [81] and [37] also provided an algorithm with asymptotic guarantees (under more restrictive conditions). It is only in [54,85,39] that a finite-time analysis was derived for KL-based UCB algorithms, KL-UCB and K inf -UCB, that achieve the asymptotic lower bounds of [81] and [37] respectively. Those algorithms make use of KL divergences in the definition of the UCBs and use the full empirical reward distribution (and not only the two first moments).…”
Section: Recent Improvementsmentioning
confidence: 99%
“…In particular, the work of (May and Leslie 2011) and the work of (Granmo 2010) prove asymptotic convergence of Thompson sampling. The performance of bandit algorithms has also been studied in terms of the rate of growth of the regret (Lai and Robbins 1995), and recent bandit algorithms have been shown to match this lower bound (Cappé et al 2013), including Thompson sampling algorithms for Bernoulli bandits (Kaufmann et al 2012). Also, the work of (Chapelle and Li 2011) presents empirical results that show Thompson sampling is highly competitive, matching or outperforming popular methods such as UCB (Lai and Robbins 1995;Auer et al 2002).…”
Section: Discussionmentioning
confidence: 99%
“…A good framework we use is called pymaBandits [15]. It already includes several learning policies like the Gittin's index, the classical UCB policy and some variations of it, the MOSS policy and some others.…”
Section: Methodsmentioning
confidence: 99%
“…The extension to multiple Bernoulli distributed experiments (MAB problem) corresponds to a situation where each move can be successful, but only one move has the highest success probability to win. An efficient implementation of the Bernoulli multi-armed bandit problem and several learning algorithms was done by Olivier Cappé et al [15]. We use this implementation to conduct our experiments generating the decision policies and reward processes.…”
Section: Multi-armed Banditsmentioning
confidence: 99%