2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP) 2017
DOI: 10.1109/globalsip.2017.8308614
|View full text |Cite
|
Sign up to set email alerts
|

Combinatorial multi-armed bandit problem with probabilistically triggered arms: A case with bounded regret

Abstract: In this paper, we study the combinatorial multi-armed bandit problem (CMAB) with probabilistically triggered arms (PTAs). Under the assumption that the arm triggering probabilities (ATPs) are positive for all arms, we prove that a class of upper confidence bound (UCB) policies, named Combinatorial UCB with exploration rate κ (CUCB-κ), and Combinatorial Thompson Sampling (CTS), which estimates the expected states of the arms via Thompson sampling, achieve bounded regret. In addition, we prove that CUCB-0 and CT… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
4
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 28 publications
0
6
0
Order By: Relevance
“…As a result of this, the upper bound for the expected regret becomes independent of the time horizon T . We compare the result of Theorem 4 with [30], which shows a similar bound for CTS in the exact same setting. While the bound in [30] is on order O((1/p * ) 4 ) with respect to p * , the bound in Theorem 4 is on order…”
Section: Theorem 4: Under Assumptions 1 2 and 3 For Allmentioning
confidence: 88%
See 2 more Smart Citations
“…As a result of this, the upper bound for the expected regret becomes independent of the time horizon T . We compare the result of Theorem 4 with [30], which shows a similar bound for CTS in the exact same setting. While the bound in [30] is on order O((1/p * ) 4 ) with respect to p * , the bound in Theorem 4 is on order…”
Section: Theorem 4: Under Assumptions 1 2 and 3 For Allmentioning
confidence: 88%
“…It is also shown in this work that the dependence on 1/p * is unavoidable for the general case. In another work [30], CMAB-PTA is considered for the case when the arm triggering probabilities are all positive, and it is shown that both CUCB and CTS achieve bounded regret. However, their O((1/p * ) 4 ) bound has a much worse dependence on…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The cascade observation feedback resembles the independent cascade model in the context of influence maximization studies (Kempe, Kleinberg, and Tardos 2003;Chen, Lakshmanan, and Castillo 2013), but the goal is different: influence maximization aims at finding a set of k seeds that generates the largest expected cascade size, while our goal is to find the best action (arm) utilizing the cascade feedback. Influence maximization has been combined with online learning in several studies (Vaswani et al 2015;Chen et al 2016;Wen et al 2017;Wang and Chen 2017;Saritaç and Tekin 2017), but again their goal is to maximize influence cascade size while using online learning to gradually learn edge probabilities.…”
Section: Related Workmentioning
confidence: 99%
“…Since there is no exploration-exploitation tradeoff in our problem, we are able to achieve bounded regret. Apart from our work, there are numerous other settings in which bounded regret is achieved: (i) the multi-armed bandit problem where the expected rewards of the arms are related to each other through a global parameter [31], [32], (ii) a specific class MDPs in which each admissible policy selects every action with a positive probability [33], (iii) combinatorial multi-armed bandits with probabilistically triggered arms, where arm triggering probabilities are strictly positive [34]. A comparison of our work with the related works is given in Table I.…”
Section: B Reinforcement Learningmentioning
confidence: 99%