“…In these cases, it is much more meaningful to minimize dynamic regret, the gap between the total reward of the optimal sequence of policies and that of the learner. Indeed, there is a surge of studies on this topic recently [Jaksch et al, 2010, Gajane et al, 2018, Li and Li, 2019, Ortner et al, 2020, Cheung et al, 2020, Fei et al, 2020, Domingues et al, 2020, Mao et al, 2020, Touati and Vincent, 2020.…”