“…This gap was partially bridged by Zimmert and Seldin (2019), who proved that online mirror descent with Tsallis-INF regularizer can achieve the optimal O(K ln T + C) bound in expectation, provided that the optimal arm is unique. Recently, the adversariallycorrupted stochastic reward model is extended to prediction with expert advice (Amir et al 2020), assortment optimization (Chen, Krishnamurthy, and Wang 2019), Gaussian bandits (Bogunovic, Krause, and Scarlett 2020), linear bandits (Kapoor, Patel, and Kar 2019;Li, Lou, and Shan 2019), and reinforcement learning (Lykouris et al 2019). Instead of studying the budget-bounded corruption setting, several papers focus on the scenario where the rewards are corrupted with a fixed probability (Altschuler, Brunel, and Malek 2019;Kapoor, Patel, and Kar 2019;Guan et al 2020).…”