Efficient Ordered Combinatorial Semi-Bandits for Whole-Page Recommendation

Wang, Yingfei; Ouyang, Hua; Wang, Chu; Chen, Jianhui; Asamov, Tsvetan; Chang, Yi

doi:10.1609/aaai.v31i1.10939

Cited by 9 publications

(6 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These algorithms maintain and iteratively update a distribution for the rewards from each arm, adjusting based on the observed outcomes in that arm. In some cases, specially when we know the exact distributions of rewards, TS-type algorithms tend to alleviate the influence of delayed feedback by randomizing over actions, thus will have a relatively better performance than other types of algorithm [13,49]. Therefore, we also consider TS-type algorithms in our study.…”

Section: Thompson Sampling Approachmentioning

confidence: 99%

Deep zero-inflated negative binomial model and its application in scRNA-seq data integration

et al. 2023

View full text Add to dashboard Cite

Many real applications of bandits have sparse non-zero rewards, leading to slow learning rates. A careful distribution modeling that utilizes problem-specific structures is known as critical to estimation efficiency in the statistics literature, yet is under-explored in bandits. To fill the gap, we initiate the study of zero-inflated bandits, where the reward is modeled as a classic semiparametric distribution called zero-inflated distribution. We carefully design Upper Confidence Bound (UCB) and Thompson Sampling (TS) algorithms for this specific structure. Our algorithms are suitable for a very general class of reward distributions, operating under tail assumptions that are considerably less stringent than the typical sub-Gaussian requirements. Theoretically, we derive the regret bounds for both the UCB and TS algorithms for multi-armed bandit, showing that they can achieve rate-optimal regret when the reward distribution is sub-Gaussian. The superior empirical performance of the proposed methods is shown via extensive numerical studies.

show abstract

Section: Thompson Sampling Approachmentioning

confidence: 99%

Deep zero-inflated negative binomial model and its application in scRNA-seq data integration

et al. 2023

View full text Add to dashboard Cite

show abstract

“…Instead of highly complex neuron-based NAS, we shed light on searching for self-attention space and limit the maximum search space by first obtaining the number of layers needed for a sample, and then moving forward to prune heads layer-by-layer to further reduce the network size. In order to determine the configurations of layers and heads for each sample in a real-time fashion, we integrate Uniform Confidence Bound multi-arm contextual bandits (UCB) (Li et al, 2010) and Thompson Sampling semi-bandits (TSP) (Wang et al, 2017) into our DHT model, where the UCBs are responsible for determining the number of layers as well as the number of heads to keep in each layer, and the TSB the combination of heads to keep in each layer.…”

Section: Introdcutionmentioning

confidence: 99%

Sample-based Dynamic Hierarchical Transformer with Layer and Head Flexibility via Contextual Bandit

Meng,

Zhang,

Chen

et al. 2023

Preprint

View full text Add to dashboard Cite

<p>Transformer requires a fixed number of layers and heads which makes them inflexible to the complexity of individual samples and expensive in training and inference. To address this, we propose a sample-based Dynamic Hierarchical Transformer (DHT) model whose layers and heads can be dynamically configured with single data samples via solving contextual bandit problems. To determine the number of layers and heads, we use the Uniform Confidence Bound algorithm while we deploy combinatorial Thompson Sampling in order to select specific head combinations given their number. Different from previous work that focuses on compressing trained networks for inference only, DHT is not only advantageous for adaptively optimizing the underlying network architecture during training but also has a flexible network for efficient inference. To the best of our knowledge, this is the first comprehensive data-driven dynamic transformer without any additional auxiliary neural networks that implement the dynamic system. According to the experiment results, we achieve up to 74\% computational savings for both training and inference with a minimal loss of accuracy.</p>

show abstract

“…The CCS problem includes the linear bandit (LB) problem (Abbasi-Yadkori, Pál, and Szepesvári 2011; Agrawal and Goyal 2013;Auer 2002;Chu et al 2011;Dani, Hayes, and Kakade 2008) and the combinatorial semi-bandit (CS) problem 1 (Chen et al 2016a,b;Combes et al 2015;Gai, Krishnamachari, and Jain 2012;Kveton et al 2015;Wang et al 2017;Wen, Kveton, and Ashkan 2015) as special cases. The difference from the LB problem is that, in the CCS problem, the learner chooses multiple arms at once.…”

Section: Introductionmentioning

confidence: 99%

“…These differences enable the CCS problem to model more realistic situations of applications such as routing networks (Kveton et al 2014), shortest paths (Gai, Krishnamachari, and Jain 2012;Wen, Kveton, and Ashkan 2015), and recommender systems (Li et al 2010;Qin, Chen, and Zhu 2014;Wang et al 2017). For example, when a recommender system is modeled with the LB problem, it is assumed that once a recommendation result is obtained, the internal predictive model is updated before the next recommendation.…”

Section: Introductionmentioning

confidence: 99%

Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions

Takemura¹,

Ito²,

Hatano

et al. 2021

AAAI

View full text Add to dashboard Cite

The contextual combinatorial semi-bandit problem with linear payoff functions is a decision-making problem in which a learner chooses a set of arms with the feature vectors in each round under given constraints so as to maximize the sum of rewards of arms. Several existing algorithms have regret bounds that are optimal with respect to the number of rounds T. However, there is a gap of Õ(max(√d, √k)) between the current best upper and lower bounds, where d is the dimension of the feature vectors, k is the number of the chosen arms in a round, and Õ(·) ignores the logarithmic factors. The dependence of k and d is of practical importance because k may be larger than T in real-world applications such as recommender systems. In this paper, we fill the gap by improving the upper and lower bounds. More precisely, we show that the C2UCB algorithm proposed by Qin, Chen, and Zhu (2014) has the optimal regret bound Õ(d√kT + dk) for the partition matroid constraints. For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C2UCB algorithm and demonstrate that it enjoys the optimal regret bound for a more general problem that can take into account other objectives simultaneously. We also show that our technique would be applicable to related problems. Numerical experiments support our theoretical results and considerations.

show abstract

Efficient Ordered Combinatorial Semi-Bandits for Whole-Page Recommendation

Cited by 9 publications

References 13 publications

Deep zero-inflated negative binomial model and its application in scRNA-seq data integration

Deep zero-inflated negative binomial model and its application in scRNA-seq data integration

Sample-based Dynamic Hierarchical Transformer with Layer and Head Flexibility via Contextual Bandit

Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions

Contact Info

Product

Resources

About