2010
DOI: 10.1007/s10998-010-3055-6
|View full text |Cite
|
Sign up to set email alerts
|

UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem

Abstract: ABSTRACT. In the stochastic multi-armed bandit problem we consider a modification of the UCB algorithm of Auer et al. [4]. For this modified algorithm we give an improved bound on the regret with respect to the optimal reward. While for the original UCB algorithm the regret in Karmed bandits after T trials is bounded by const ·, where ∆ measures the distance between a suboptimal arm and the optimal arm, for the modified UCB algorithm we show an upper bound on the regret of const · K log(T ∆ 2 ) ∆ .

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

4
217
0
1

Year Published

2013
2013
2021
2021

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 238 publications
(222 citation statements)
references
References 9 publications
4
217
0
1
Order By: Relevance
“…Given K ≥ 2 routes and sequences r i,1 , r i,2 ... of unknown rewards associated with each route i = 1, ..., K, at each trial, t = 1, ..., n, players select a route I t and receive the associated reward r It,t . Let r * i,t be the best reward possible from route i on trial t ; (Auer & Ortner, 2010). The regret after n plays I 1 , ..., I n is defined by…”
Section: Methods 1: Absolute Regretmentioning
confidence: 99%
See 2 more Smart Citations
“…Given K ≥ 2 routes and sequences r i,1 , r i,2 ... of unknown rewards associated with each route i = 1, ..., K, at each trial, t = 1, ..., n, players select a route I t and receive the associated reward r It,t . Let r * i,t be the best reward possible from route i on trial t ; (Auer & Ortner, 2010). The regret after n plays I 1 , ..., I n is defined by…”
Section: Methods 1: Absolute Regretmentioning
confidence: 99%
“…The player is simultaneously attempting to improve their estimate of the options and optimize this knowledge to maximize their score. Understanding the theory and leveraging published work on N -arm Bandits gives insight to available methods to understanding military decision-making (Audibert et al, 2007;Auer et al, 2002;Auer & Ortner, 2010).…”
Section: N -Arm Bandit Problemmentioning
confidence: 99%
See 1 more Smart Citation
“…All these models propose selection policies to minimize the agent's regret: the difference between the reward it obtained and how much it could had won if it had always pulled the best arm. All this policies, such as UCB, Poker, ε-greedy [19,20], are a compromise between pulling the arm which has the best expected reward and pulling another arm in order to increase the agent's knowledge on the reward distributions (known as the exploration -exploitation compromise). In this paper, we propose to draw an analogy between both problems: the selection of agents evaluated by a reputation value and the selection of arms evaluated by an estimated reward function.…”
Section: Related Workmentioning
confidence: 99%
“…However, the learning setting is beyond the scope of this dissertation. Please refer to (Auer and Ortner, 2010;Kuleshov and Precup, 2014) for an overview of BP learning algorithms, and to (Drugan and Nowé, 2013;Yahyaa et al, 2014) for MOBP algorithms.…”
mentioning
confidence: 99%