Combinatorial multi-armed bandit problem with probabilistically triggered arms: A case with bounded regret

Saritac, A. Omer; Tekin, Cem

doi:10.1109/globalsip.2017.8308614

Cited by 5 publications

(6 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a result of this, the upper bound for the expected regret becomes independent of the time horizon T . We compare the result of Theorem 4 with [30], which shows a similar bound for CTS in the exact same setting. While the bound in [30] is on order O((1/p * ) 4 ) with respect to p * , the bound in Theorem 4 is on order…”

Section: Theorem 4: Under Assumptions 1 2 and 3 For Allmentioning

confidence: 88%

“…It is also shown in this work that the dependence on 1/p * is unavoidable for the general case. In another work [30], CMAB-PTA is considered for the case when the arm triggering probabilities are all positive, and it is shown that both CUCB and CTS achieve bounded regret. However, their O((1/p * ) 4 ) bound has a much worse dependence on…”

Section: Related Workmentioning

confidence: 99%

“…Our experimental setup for this case is based on MovieLens dataset [45] as in [30]. 4 The dataset contains 20 million movie ratings that are assigned between January 1995 and March 2015.…”

Section: B Probabilistic Maximum Coverage Banditsmentioning

confidence: 99%

See 2 more Smart Citations

Thompson Sampling for Combinatorial Network Optimization in Unknown Environments

Hüyük

Tekin

2020

IEEE/ACM Trans. Networking

Self Cite

View full text Add to dashboard Cite

Influence maximization, adaptive routing, and dynamic spectrum allocation all require choosing the right action from a large set of alternatives. Thanks to the advances in combinatorial optimization, these and many similar problems can be efficiently solved given an environment with known stochasticity. In this paper, we take this one step further and focus on combinatorial optimization in unknown environments. We consider a very general learning framework called combinatorial multi-armed bandit with probabilistically triggered arms and a very powerful Bayesian algorithm called Combinatorial Thompson Sampling (CTS). Under the semi-bandit feedback model and assuming access to an oracle without knowing the expected base arm outcomes beforehand, we show that when the expected reward is Lipschitz continuous in the expected base arm outcomes CTS achieves O(È m i=1 log T /(piΔi)) regret and O(max{E[m Ô T log T /p * ], E[m 2 /p * ]}) Bayesian regret, where m denotes the number of base arms, pi and Δi denote the minimum non-zero triggering probability and the minimum suboptimality gap of base arm i respectively, T denotes the time horizon, and p * denotes the overall minimum non-zero triggering probability. We also show that when the expected reward satisfies the triggering probability modulated Lipschitz continuity, CTS achieves O(max{m √ T log T , m 2 }) Bayesian regret, and when triggering probabilities are non-zero for all base arms, CTS achieves O(1/p * log(1/p *)) regret independent of the time horizon. Finally, we numerically compare CTS with algorithms based on upper confidence bounds in several networking problems and show that CTS outperforms these algorithms by at least an order of magnitude in majority of the cases.

show abstract

Section: Theorem 4: Under Assumptions 1 2 and 3 For Allmentioning

confidence: 88%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Thompson Sampling for Combinatorial Network Optimization in Unknown Environments

Hüyük

Tekin

2020

IEEE/ACM Trans. Networking

Self Cite

View full text Add to dashboard Cite

show abstract

“…The cascade observation feedback resembles the independent cascade model in the context of influence maximization studies (Kempe, Kleinberg, and Tardos 2003;Chen, Lakshmanan, and Castillo 2013), but the goal is different: influence maximization aims at finding a set of k seeds that generates the largest expected cascade size, while our goal is to find the best action (arm) utilizing the cascade feedback. Influence maximization has been combined with online learning in several studies (Vaswani et al 2015;Chen et al 2016;Wen et al 2017;Wang and Chen 2017;Saritaç and Tekin 2017), but again their goal is to maximize influence cascade size while using online learning to gradually learn edge probabilities.…”

Section: Related Workmentioning

confidence: 99%

Stochastic Online Learning with Probabilistic Graph Feedback

Chen

Wen

et al. 2020

AAAI

View full text Add to dashboard Cite

We consider a problem of stochastic online learning with general probabilistic graph feedback, where each directed edge in the feedback graph has probability pij. Two cases are covered. (a) The one-step case, where after playing arm i the learner observes a sample reward feedback of arm j with independent probability pij. (b) The cascade case where after playing arm i the learner observes feedback of all arms j in a probabilistic cascade starting from i – for each (i,j) with probability pij, if arm i is played or observed, then a reward sample of arm j would be observed with independent probability pij. Previous works mainly focus on deterministic graphs which corresponds to one-step case with pij ∈ {0,1}, an adversarial sequence of graphs with certain topology guarantees, or a specific type of random graphs. We analyze the asymptotic lower bounds and design algorithms in both cases. The regret upper bounds of the algorithms match the lower bounds with high probability.

show abstract

“…Since there is no exploration-exploitation tradeoff in our problem, we are able to achieve bounded regret. Apart from our work, there are numerous other settings in which bounded regret is achieved: (i) the multi-armed bandit problem where the expected rewards of the arms are related to each other through a global parameter [31], [32], (ii) a specific class MDPs in which each admissible policy selects every action with a positive probability [33], (iii) combinatorial multi-armed bandits with probabilistically triggered arms, where arm triggering probabilities are strictly positive [34]. A comparison of our work with the related works is given in Table I.…”

Section: B Reinforcement Learningmentioning

confidence: 99%

Online Learning in Limit Order Book Trade Execution

Akbarzadeh

Tekin

Schaar³

2018

IEEE Trans. Signal Process.

Self Cite

View full text Add to dashboard Cite

In this paper, we propose an online learning algorithm for optimal execution in the limit order book of a financial asset. Given a certain number of shares to sell and an allocated time window to complete the transaction, the proposed algorithm dynamically learns the optimal number of shares to sell via market orders at prespecified time slots within the allocated time interval. We model this problem as a Markov Decision Process (MDP), which is then solved by dynamic programming. First, we prove that the optimal policy has a specific form, which requires either selling no shares or the maximum allowed amount of shares at each time slot. Then, we consider the learning problem, in which the state transition probabilities are unknown and need to be learned on the fly. We propose a learning algorithm that exploits the form of the optimal policy when choosing the amount to trade. Interestingly, this algorithm achieves bounded regret with respect to the optimal policy computed based on the complete knowledge of the market dynamics. Our numerical results on several finance datasets show that the proposed algorithm performs significantly better than the traditional Q-learning algorithm by exploiting the structure of the problem.Index Terms-Limit order book, Markov decision process, online learning, dynamic programming, bounded regret.

show abstract

Combinatorial multi-armed bandit problem with probabilistically triggered arms: A case with bounded regret

Cited by 5 publications

References 28 publications

Thompson Sampling for Combinatorial Network Optimization in Unknown Environments

Thompson Sampling for Combinatorial Network Optimization in Unknown Environments

Stochastic Online Learning with Probabilistic Graph Feedback

Online Learning in Limit Order Book Trade Execution

Contact Info

Product

Resources

About