Thompson Sampling on Asymmetric $α$-Stable Bandits

Shi, Z.B.; Kuruoğlu, Erçan E.; Wei, Xiaoli

doi:10.48550/arxiv.2203.10214

Cited by 1 publication

(3 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In UCB-type algorithms, the confidence level δ is set to 4/T 2 , maintaining consistency. The prior parameters and tuning parameters for both TS-type algorithms are configured in accordance with the recommendations provided in [23,39] for the MOTS algorithm and GMS generation. The simulation results with different size of p = max k∈[K] p k are shown in Figure 1, Figure 2, and Figure 3.…”

Section: Simulation Resultsmentioning

confidence: 99%

“…where the function g(•, ϵ) is defined in Lemma 2. and U µ k = µ At + M 1/(1+ϵ) 32 log t/c k (t) ϵ/(1+ϵ) ; end 'Chambers-Mallows-Stuck (CMS) Generation,' to rescale the non-zero part to a sub-Gaussian tail. Further details on this can be found in [50,17,39]. Diverging from the standard TS algorithm for Gaussian rewards, we use a clipped Gaussian distribution cl N (µ, σ 2 ; ϑ) := max N (µ, σ 2 ), ϑ as the posterior for the non-zero sub-Gaussian part X.…”

Section: Thompson Sampling Approachmentioning

confidence: 99%

“…For Gaussian and mixed-Gaussian rewards, we can directly apply both Algorithm 3 and the MOTS algorithm. But, when applying with Exponential rewards, we adopt Algorithm 1 from [39]. In doing so, we integrate their step 5 with our algorithm and the MOTS algorithm.…”

Section: Ts Baselinesmentioning

confidence: 99%

See 2 more Smart Citations

Deep zero-inflated negative binomial model and its application in scRNA-seq data integration

et al. 2023

View full text Add to dashboard Cite

Many real applications of bandits have sparse non-zero rewards, leading to slow learning rates. A careful distribution modeling that utilizes problem-specific structures is known as critical to estimation efficiency in the statistics literature, yet is under-explored in bandits. To fill the gap, we initiate the study of zero-inflated bandits, where the reward is modeled as a classic semiparametric distribution called zero-inflated distribution. We carefully design Upper Confidence Bound (UCB) and Thompson Sampling (TS) algorithms for this specific structure. Our algorithms are suitable for a very general class of reward distributions, operating under tail assumptions that are considerably less stringent than the typical sub-Gaussian requirements. Theoretically, we derive the regret bounds for both the UCB and TS algorithms for multi-armed bandit, showing that they can achieve rate-optimal regret when the reward distribution is sub-Gaussian. The superior empirical performance of the proposed methods is shown via extensive numerical studies.

show abstract

Section: Simulation Resultsmentioning

confidence: 99%

Section: Thompson Sampling Approachmentioning

confidence: 99%