Explore First, Exploit Next: The True Shape of Regret in Bandit Problems

Garivier, Aurélien; Ménard, Pierre; Stoltz, Gilles

doi:10.48550/arxiv.1602.07182

Cited by 9 publications

(19 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Perhaps the most important remaining problem for the subgaussian noise model is the question of lower bounds. Besides the asymptotic results by Lai and Robbins [1985] and Burnetas and Katehakis [1997] there has been some recent progress on finite-time lower bounds, both in the OCUCB paper and a recent article by Garivier et al [2016]. Some further progress is made in Appendix A, but still there are regimes where the bounds are not very precise.…”

Section: Discussionmentioning

confidence: 99%

Regret Analysis of the Anytime Optimally Confident UCB Algorithm

Lattimore

2016

Preprint

View full text Add to dashboard Cite

I introduce and analyse an anytime version of the Optimally Confident UCB (OCUCB) algorithm designed for minimising the cumulative regret in finitearmed stochastic bandits with subgaussian noise. The new algorithm is simple, intuitive (in hindsight) and comes with the strongest finite-time regret guarantees for a horizon-free algorithm so far. I also show a finite-time lower bound that nearly matches the upper bound.

show abstract

Section: Discussionmentioning

confidence: 99%

Regret Analysis of the Anytime Optimally Confident UCB Algorithm

Lattimore

2016

Preprint

View full text Add to dashboard Cite

show abstract

“…using the fact that the binary KL-divergence satisfies kl(x, y) = kl(1 − x, 1 − y) as well as the inequality kl(x, y) ≥ x log (1 y) − log(2), proved by Garivier et al (2016). Now, using Markov inequality yields…”

Section: A3 Proof Of Lemmamentioning

confidence: 98%

“…λ), given a fixed algorithm. Using the exact same technique as Garivier et al (2016) (the contraction of entropy principle), one can establish that for any event A that is σ(I t )-measurable 7 , KL P It µ , P It λ ≥ kl (P µ (A), P λ (A)) .…”

Section: B2 Proof Of Lemma 12mentioning

confidence: 99%

“…The next step is to relate the complicated KL-divergence KL P It µ , P It λ to the number of arm selections. Proceeding similarly as Garivier et al (2016), one can write, using the chain rule for KL-divergence, that…”

Section: B2 Proof Of Lemma 12mentioning

confidence: 99%

“…Now observe that conditionally to I t−1 , U t , Y t and C t are independent, as once the selected arm is known, the value of the sensing Y t does not influence the other players selecting that arm, and U t is some exogenous randomness. Using further that the distribution of U t is the same under µ and λ, one obtains The first term in (25) can be rewritten using the same argument as Garivier et al (2016), that relies on the fact that conditionally to I t−1 , Y t is a Bernoulli distribution with mean µ A j (t) under the instance µ and λ A j (t) under the instance λ:…”

Section: B2 Proof Of Lemma 12mentioning

confidence: 99%

See 2 more Smart Citations

Multi-Player Bandits Revisited

Besson,

Kaufmann

2017

Preprint

View full text Add to dashboard Cite

Multi-player Multi-Armed Bandits (MAB) have been extensively studied in the literature, motivated by applications to Cognitive Radio systems. Driven by such applications as well, we motivate the introduction of several levels of feedback for multi-player MAB algorithms. Most existing work assume that sensing information is available to the algorithm. Under this assumption, we improve the state-of-theart lower bound for the regret of any decentralized algorithms and introduce two algorithms, RandTopM and MCTopM, that are shown to empirically outperform existing algorithms. Moreover, we provide strong theoretical guarantees for these algorithms, including a notion of asymptotic optimality in terms of the number of selections of bad arms. We then introduce a promising heuristic, called Selfish, that can operate without sensing information, which is crucial for emerging applications to Internet of Things networks. We investigate the empirical performance of this algorithm and provide some first theoretical elements for the understanding of its behavior.

show abstract

Asymptotically optimal algorithms for budgeted multiple play bandits

2019

View full text Add to dashboard Cite

We study a generalization of the multi-armed bandit problem with multiple plays where there is a cost associated with pulling each arm and the agent has a budget at each time that dictates how much she can expect to spend. We derive an asymptotic regret lower bound for any uniformly efficient algorithm in our setting. We then study a variant of Thompson sampling for Bernoulli rewards and a variant of KL-UCB for both single-parameter exponential families and bounded, finitely supported rewards. We show these algorithms are asymptotically optimal, both in rate and leading problemdependent constants, including in the thick margin setting where multiple arms fall on the decision boundary.

show abstract

Explore First, Exploit Next: The True Shape of Regret in Bandit Problems

Cited by 9 publications

References 0 publications

Regret Analysis of the Anytime Optimally Confident UCB Algorithm

Regret Analysis of the Anytime Optimally Confident UCB Algorithm

Multi-Player Bandits Revisited

Asymptotically optimal algorithms for budgeted multiple play bandits

Contact Info

Product

Resources

About