Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem

Merlis, Nadav; Mannor, Shie

doi:10.48550/arxiv.1905.03125

Cited by 3 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, there are interesting results under Gini-weighted smoothness assumptions (Merlis & Mannor, 2019;2020). Compared with general Lipschitz smoothness considered in this work, this is a more refined smoothness assumption, which leads to near optimal regret bounds with less dependence on the dimension K. Directly applying our algorithms to this setting will lead to an additional dependence on K. How to remove this additional price for privacy preserving, and how to prove the corresponding lower bounds, are interesting problems for future work.…”

Section: Discussionmentioning

confidence: 94%

(Locally) Differentially Private Combinatorial Semi-Bandits

Chen¹,

Zheng²,

Zhou³

et al. 2020

Preprint

View full text Add to dashboard Cite

In this paper, we study Combinatorial Semi-Bandits (CSB) that is an extension of classic Multi-Armed Bandits (MAB) under Differential Privacy (DP) and stronger Local Differential Privacy (LDP) setting. Since the server receives more information from users in CSB, it usually causes additional dependence on the dimension of data, which is a notorious side-effect for privacy preserving learning. However for CSB under two common smoothness assumptions (Kveton et al., 2015;Chen et al., 2016), we show it is possible to remove this side-effect. In detail, for B ∞ -bounded smooth CSB under either ε-LDP or ε-DP, we prove the optimal regret bound is Θ() respectively, where T is time period, ∆ is the gap of rewards and m is the number of base arms, by proposing novel algorithms and matching lower bounds. For B 1 -bounded smooth CSB under ε-DP, we also prove the optimal regret bound is Θ() with both upper bound and lower bound, where K is the maximum number of feedback in each round. All above results nearly match corresponding non-private optimal rates, which imply there is no additional price for (locally) differentially private CSB in above common settings.

show abstract

Section: Discussionmentioning

confidence: 94%

(Locally) Differentially Private Combinatorial Semi-Bandits

Chen¹,

Zheng²,

Zhou³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…We will now review the combinatorial action space literature in multi-armed bandit (MAB) problems. Most of the work in this space deals with semi-bandit feedback (Chen et al, 2016;Combes et al, 2015;Kveton et al, 2015;Merlis and Mannor, 2019). This is also our feedback model, but we work in a contextual setting.…”

Section: Related Workmentioning

confidence: 99%

“…There is also work in the full-bandit feedback setting, where one gets to observe only one representative reward for the whole set of arms chosen. This body of literature can be divided into the adversarial setting (Merlis and Mannor, 2019;Cesa-Bianchi and Lugosi, 2012) and the stochastic setting Agarwal and Aggarwal, 2018;Lin et al, 2014;Rejwan and Mansour, 2020).…”

Section: Related Workmentioning

confidence: 99%

Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy

Sen¹,

Rakhlin²,

Ying³

et al. 2021

Preprint

View full text Add to dashboard Cite

Motivated by modern applications, such as online advertisement and recommender systems, we study the top-k eXtreme contextual bandits problem, where the total number of arms can be enormous, and the learner is allowed to select k arms and observe all or some of the rewards for the chosen arms. We first propose an algorithm for the non-eXtreme realizable setting, utilizing the Inverse Gap Weighting strategy for selecting multiple arms. We show that our algorithm has a regret guarantee of O(k (A − k + 1)T log(|F|T )), where A is the total number of arms and F is the class containing the regression function, while only requiring Õ(A) computation per time step. In the eXtreme setting, where the total number of arms can be in the millions, we propose a practically-motivated arm hierarchy model that induces a certain structure in mean rewards to ensure statistical and computational efficiency. The hierarchical structure allows for an exponential reduction in the number of relevant arms for each context, thus resulting in a regret guarantee of O(k (log A − k + 1)T log(|F|T )). Finally, we implement our algorithm using a hierarchical linear function class and show superior performance with respect to wellknown benchmarks on simulated bandit feedback experiments using eXtreme multi-label classification datasets. On a dataset with three million arms, our reduction scheme has an average inference time of only 7.9 milliseconds, which is a 100x improvement.

show abstract

“…Combinatorial bandit is a classical question in machine learning and has been extensively studied under the settings of stochastic semi-bandits (Chen et al, 2013;Combes et al, 2015;Kveton et al, 2015;Chen et al, 2016a;Merlis & Mannor, 2019;, stochastic bandits (Agarwal & Aggarwal, 2018;Rejwan & Mansour, 2020;Kuroki et al, 2020), and adversarial linear bandits (Cesa-Bianchi & Lugosi, 2012;Bubeck et al, 2012;Audibert et al, 2014;Combes et al, 2015). In the above-mentioned works, either the reward link function g(•) is linear, or the model is stochastic (stationary).…”

Section: Other Related Workmentioning

confidence: 99%

Adversarial Combinatorial Bandits with General Non-linear Reward Functions

Chen,

Han,

Wang

2021

Preprint

View full text Add to dashboard Cite

In this paper we study the adversarial combinatorial bandit with a known non-linear reward function, extending existing work on adversarial linear combinatorial bandit. The adversarial combinatorial bandit with general non-linear reward is an important open problem in bandit literature, and it is still unclear whether there is a significant gap from the case of linear reward, stochastic bandit, or semi-bandit feedback. We show that, with N arms and subsets of K arms being chosen at each of T time periods, the minimax optimal regret is Θ d ( √ N d T ) if the reward function is a d-degree polynomial with d < K, and Θ K ( √ N K T ) if the reward function is not a lowdegree polynomial. Both bounds are significantly different from the bound O( poly(N, K)T ) for the linear case, which suggests that there is a fundamental gap between the linear and non-linear reward structures. Our result also finds applications to adversarial assortment optimization problem in online recommendation. We show that in the worst-case of adversarial assortment problem, the optimal algorithm must treat each individual N K assortment as independent.

show abstract

Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem

Cited by 3 publications

References 13 publications

(Locally) Differentially Private Combinatorial Semi-Bandits

(Locally) Differentially Private Combinatorial Semi-Bandits

Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy

Adversarial Combinatorial Bandits with General Non-linear Reward Functions

Contact Info

Product

Resources

About