2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2011
DOI: 10.1109/icassp.2011.5946273
|View full text |Cite
|
Sign up to set email alerts
|

The non-Bayesian restless multi-armed bandit: A case of near-logarithmic regret

Abstract: In the classic Bayesian restless multi-armed bandit (RMAB) problem, there are N arms, with rewards on all arms evolving at each time as Markov chains with known parameters. A player seeks to activate K ≥ 1 arms at each time in order to maximize the expected total reward obtained over multiple plays. RMAB is a challenging problem that is known to be PSPACE-hard in general. We consider in this work the even harder non-Bayesian RMAB, in which the parameters of the Markov chain are assumed to be unknown a priori. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
65
0

Year Published

2011
2011
2014
2014

Publication Types

Select...
6

Relationship

2
4

Authors

Journals

citations
Cited by 58 publications
(65 citation statements)
references
References 31 publications
(45 reference statements)
0
65
0
Order By: Relevance
“…The policy proposed in [10] also uses the index form of UCB-1 given in [5], but the structure is different from RUCB proposed in this paper. In [11], a stronger definition of regret is adopted, where regret is defined as reward loss with respect to the optimal performance in the ideal scenario of known reward model. However, the problem can only be solved for a special class of RMAB.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The policy proposed in [10] also uses the index form of UCB-1 given in [5], but the structure is different from RUCB proposed in this paper. In [11], a stronger definition of regret is adopted, where regret is defined as reward loss with respect to the optimal performance in the ideal scenario of known reward model. However, the problem can only be solved for a special class of RMAB.…”
Section: Related Workmentioning
confidence: 99%
“…Specifically, when arms are governed by stochastically identical two-state Markov chains, a policy was constructed in [11] to achieve a regret with an order arbitrarily close to logarithmic.…”
Section: Related Workmentioning
confidence: 99%
“…In [4], it has been shown that this structure can be exploited to obtain an efficient online learning algorithm for the non-Bayesian version of the problem (where the underlying transition matrix is completely unknown). In particular, Dai et al [4] show that near logarithmic regret (defined as the difference between cumulative rewards obtained by a model-aware optimal-policy-implementing genie and that obtained by their policy) with respect to time 1 can be achieved by mapping two particular policies to arms in a different multi-armed bandit. For the more general case of non-identical arms, there have been some recent results that show near-logarithmic weak regret (measured with respect to the best possible single-channel-selection policy, which need not be optimal) [5], [6], [9].…”
Section: Introductionmentioning
confidence: 99%
“…The problem of dynamic channel selection has recently been formulated and studied by many researchers [1], [2], [3], [4], [5], [6] under the framework of multi-armed bandits (MAB) [7]. In these papers, the channels are typically modelled as indpendent Gilbert-Elliott channels (i.e., described by two-state Markov chains, with a bad state "0" and a good state "1").…”
Section: Introductionmentioning
confidence: 99%
“…[10] proposes a policy based on deterministic sequence of exploration and exploitation and achieves the same bounds for weak regret. In [11], the authors consider the notion of strong regret and propose a policy which achieves near-log T (strong) regret for some special cases of the restless model.…”
Section: Introductionmentioning
confidence: 99%