In this paper, we study the combinatorial multi-armed bandit problem (CMAB) with probabilistically triggered arms (PTAs). Under the assumption that the arm triggering probabilities (ATPs) are positive for all arms, we prove that a class of upper confidence bound (UCB) policies, named Combinatorial UCB with exploration rate κ (CUCB-κ), and Combinatorial Thompson Sampling (CTS), which estimates the expected states of the arms via Thompson sampling, achieve bounded regret. In addition, we prove that CUCB-0 and CTS incur O( √ T ) gap-independent regret. These results improve the results in previous works, which show O(log T ) gap-dependent and O( T log T ) gap-independent regrets, respectively, under no assumptions on the ATPs.Then, we numerically evaluate the performance of CUCB-κ and CTS in a real-world movie recommendation problem, where the actions correspond to recommending a set of movies, the arms correspond to the edges between the movies and the users, and the goal is to maximize the total number of users that are attracted by at least one movie. Our numerical results complement our theoretical findings on bounded regret. Apart from this problem, our results also directly apply to the online influence maximization (OIM) problem studied in numerous prior works.
In the Online Contextual Influence Maximization Problem with Costly Observations, the learner faces a series of epochs in each of which a different influence spread process takes place over a network. At the beginning of each epoch, the learner exogenously influences (activates) a set of seed nodes in the network. Then, the influence spread process takes place over the network, through which other nodes get influenced. The learner has the option to observe the spread of influence by paying an observation cost. The goal of the learner is to maximize its cumulative reward, which is defined as the expected total number of influenced nodes over all epochs minus the observation costs. We depart from the prior work in three aspects: (i) the learner does not know how the influence spreads over the network, i.e., it is unaware of the influence probabilities; (ii) influence probabilities depend on the context; (iii) observing influence is costly. We consider two different influence observation settings: costly edge-level feedback, in which the learner freely observes the set of influenced nodes, but pays to observe the influence outcomes on the edges of the network; and costly node-level feedback, in which the learner pays to observe whether a node is influenced or not. Since the offline influence maximization problem itself is NP-hard, for these settings, we develop online learning algorithms that use an approximation algorithm as a subroutine to obtain the set of seed nodes in each epoch. When the influence probabilities are Hölder continuous functions of the context, we prove that these algorithms achieve sublinear regret (for any sequence of contexts) with respect to an approximation oracle that knows the influence probabilities for all contexts. Our numerical results on several networks illustrate that the proposed algorithms perform on par with the state-of-the-art methods even when the observations are cost-free. .tr.A preliminary version of this work has appeared in Allerton 2016 [1].
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.