Nonstochastic Bandits and Experts with Arm-Dependent Delays

Hoeven, Dirk van der; Cesa-Bianchi, Nicolò

doi:10.48550/arxiv.2111.01589

Search citation statements

Order By: Relevance

Paper Sections

Select...

Additional Related Work1

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2022

Publication Types

Select...

Other1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

(1 citation statement)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Delays in multi-arm bandit (MAB). Delays were extensively studied in MAB and optimization both in the stochastic setting (Agarwal & Duchi, 2012;Vernade et al, 2017;Pike-Burke et al, 2018;Cesa-Bianchi et al, 2018;Zhou et al, 2019;Gael et al, 2020;Lancewicki et al, 2021;Cohen et al, 2021a), and the adversarial setting (Quanrud & Khashabi, 2015;Cesa-Bianchi et al, 2016;Thune et al, 2019;Bistritz et al, 2019;Zimmert & Seldin, 2020;Ito et al, 2020;Gyorgy & Joulani, 2021;van der Hoeven & Cesa-Bianchi, 2021).…”

Section: Additional Related Workmentioning

confidence: 99%

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Jin¹,

Lancewicki²,

Luo³

et al. 2022

Preprint

View full text Add to dashboard Cite

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode k is revealed only in the end of episode k + d k , where the delay d k can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal √ K + D regret, where K is the number of episodes and D = K k=1 d k is the total delay, significantly improving upon the best known regret bound of (K + D) 2/3 .

show abstract

Section: Additional Related Workmentioning

confidence: 99%