2019
DOI: 10.48550/arxiv.1910.01161
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Stochastic Bandits with Delayed Composite Anonymous Feedback

Abstract: We explore a novel setting of the Multi-Armed Bandit (MAB) problem inspired from real world applications which we call bandits with "stochastic delayed composite anonymous feedback (SDCAF)". In SDCAF, the rewards on pulling arms are stochastic with respect to time but spread over a fixed number of time steps in the future after pulling the arm. The complexity of this problem stems from the anonymous feedback to the player and the stochastic generation of the reward. Due to the aggregated nature of the rewards,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
9
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(9 citation statements)
references
References 4 publications
0
9
0
Order By: Relevance
“…The regret upper bound of ODAAF is O(N (log T + E[d])), which is the same as BOLD with non-anonymous feedback. (Garg and Akash 2019) then explores the composite and anonymous feedback setting and makes some minor changes to generalize ODAAF policy. However, their algorithm still needs to use precise knowledge of the reward interval.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…The regret upper bound of ODAAF is O(N (log T + E[d])), which is the same as BOLD with non-anonymous feedback. (Garg and Akash 2019) then explores the composite and anonymous feedback setting and makes some minor changes to generalize ODAAF policy. However, their algorithm still needs to use precise knowledge of the reward interval.…”
Section: Related Workmentioning
confidence: 99%
“…The value d 1 in the theorem can be regarded as an upper bound of the expected rewards in the triangle region (in Figure 1), and d 2 is an upper bound of the variance. Compared to the ODAAF policy in (Pike-Burke et al 2018;Garg and Akash 2019), the regret upper bound of ARS-UCB also depends on the mean and variance of the feedback delay. However, our algorithm has the advantage that it does not require any prior information about d 1 and d 2 , whereas the ODAAF policy takes both d 1 and d 2 as inputs.…”
Section: Ni(t)mentioning
confidence: 99%
See 1 more Smart Citation
“…Two other papers consider the case where the delays are not observed at all -but are bounded by a constant D > 0. (Garg and Akash, 2019) analyze the stochastic setting, and (Cesa-Bianchi et al, 2018) the adversarial setting and achieve a regret of order √ T K log K + KD log T and √ DT K respectively. (Pike-Burke et al, 2018) when further considering unbounded delays in adversarial setting but time under the assumption that only their expectation is bounded.…”
Section: Related Workmentioning
confidence: 99%
“…Prior work for delayed bandits have bypassed the challenges above by assuming that the delays are observed (Joulani et al, 2013;Dudik et al, 2011), which removes the ambiguity, or bounded by a fixed quantity (Pike-Burke et al, 2018;Garg and Akash, 2019;Cesa-Bianchi et al, 2018), which gives other possibilities to deal with them. Another approach that has been proposed by (Vernade et al, 2017) is to drop the artificial requirement of observability of delays, and instead impose that all delays have the same distribution across arms and that this distribution is known.…”
Section: Introductionmentioning
confidence: 99%