Stochastic Bandits with Delayed Composite Anonymous Feedback

Garg, Siddhant; Akash, Aditya Kumar

doi:10.48550/arxiv.1910.01161

“…The regret upper bound of ODAAF is O(N (log T + E[d])), which is the same as BOLD with non-anonymous feedback. (Garg and Akash 2019) then explores the composite and anonymous feedback setting and makes some minor changes to generalize ODAAF policy. However, their algorithm still needs to use precise knowledge of the reward interval.…”

Section: Related Workmentioning

confidence: 99%

“…The value d 1 in the theorem can be regarded as an upper bound of the expected rewards in the triangle region (in Figure 1), and d 2 is an upper bound of the variance. Compared to the ODAAF policy in (Pike-Burke et al 2018;Garg and Akash 2019), the regret upper bound of ARS-UCB also depends on the mean and variance of the feedback delay. However, our algorithm has the advantage that it does not require any prior information about d 1 and d 2 , whereas the ODAAF policy takes both d 1 and d 2 as inputs.…”

Section: Ni(t)mentioning

confidence: 99%

See 1 more Smart Citation

Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback

Wang¹,

Wang²,

Huang³

2020

Preprint

View full text Add to dashboard Cite

We study the multi-armed bandit (MAB) problem with composite and anonymous feedback. In this model, the reward of pulling an arm spreads over a period of time (we call this period as reward interval) and the player receives partial rewards of the action, convoluted with rewards from pulling other arms, successively. Existing results on this model require prior knowledge about the reward interval size as an input to their algorithms. In this paper, we propose adaptive algorithms for both the stochastic and the adversarial cases, without requiring any prior information about the reward interval. For the stochastic case, we prove that our algorithm guarantees a regret that matches the lower bounds (in order). For the adversarial case, we propose the first algorithm to jointly handle non-oblivious adversary and unknown reward interval size. We also conduct simulations based on real-world dataset. The results show that our algorithms outperform existing benchmarks.

show abstract

“…Two other papers consider the case where the delays are not observed at all -but are bounded by a constant D > 0. (Garg and Akash, 2019) analyze the stochastic setting, and (Cesa-Bianchi et al, 2018) the adversarial setting and achieve a regret of order √ T K log K + KD log T and √ DT K respectively. (Pike-Burke et al, 2018) when further considering unbounded delays in adversarial setting but time under the assumption that only their expectation is bounded.…”

Section: Related Workmentioning

confidence: 99%

“…Prior work for delayed bandits have bypassed the challenges above by assuming that the delays are observed (Joulani et al, 2013;Dudik et al, 2011), which removes the ambiguity, or bounded by a fixed quantity (Pike-Burke et al, 2018;Garg and Akash, 2019;Cesa-Bianchi et al, 2018), which gives other possibilities to deal with them. Another approach that has been proposed by (Vernade et al, 2017) is to drop the artificial requirement of observability of delays, and instead impose that all delays have the same distribution across arms and that this distribution is known.…”

Section: Introductionmentioning

confidence: 99%

Stochastic bandits with arm-dependent delays

Manegueu¹,

Vernade²,

Carpentier³

et al. 2020

Preprint

0

View full text Add to dashboard Cite

Significant work has been recently dedicated to the stochastic delayed bandit setting because of its relevance in applications. The applicability of existing algorithms is however restricted by the fact that strong assumptions are often made on the delay distributions, such as full observability, restrictive shape constraints, or uniformity over arms. In this work, we weaken them significantly and only assume that there is a bound on the tail of the delay. In particular, we cover the important case where the delay distributions vary across arms, and the case where the delays are heavy-tailed. Addressing these difficulties, we propose a simple but efficient UCB-based algorithm called the PatientBandits. We provide both problemsdependent and problems-independent bounds on the regret as well as performance lower bounds.1 Ex.: Consider two instances for : (1) reward follows a Bernoulli(1) and delay is a Dirac in +∞ and (2) reward follows a Bernoulli(0) and delay is a Dirac in 0. Both instances produce the same data but have strictly different parameters.

show abstract

Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback

Wang

¹

,

Wang

²

,

Huang

³

2021

AAAI

View full text Add to dashboard Cite

We study the multi-armed bandit (MAB) problem with composite and anonymous feedback. In this model, the reward of pulling an arm spreads over a period of time (we call this period as reward interval) and the player receives partial rewards of the action, convoluted with rewards from pulling other arms, successively. Existing results on this model require prior knowledge about the reward interval size as an input to their algorithms. In this paper, we propose adaptive algorithms for both the stochastic and the adversarial cases, without requiring any prior information about the reward interval. For the stochastic case, we prove that our algorithm guarantees a regret that matches the lower bounds (in order). For the adversarial case, we propose the first algorithm to jointly handle non-oblivious adversary and unknown reward interval size. We also conduct simulations based on real-world dataset. The results show that our algorithms outperform existing benchmarks.

show abstract

Stochastic Bandits with Delayed Composite Anonymous Feedback

Cited by 5 publications

References 4 publications

Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback

Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback

Stochastic bandits with arm-dependent delays

Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback

Contact Info

Product

Resources

About