“…Bandit algorithms (Bubeck and Cesa-Bianchi, 2012;Lattimore and Szepesvári, 2020) and reinforcement learning (Sutton and Barto, 2018) are modern strategies to solve sequential decision making problems. They have received recent attentions in statistics community for business and scientific applications including dynamic pricing (Wang et al, 2020;Chen, Simchi-Levi and Wang, 2021;Chen, Miao and Wang, 2021;Wang et al, 2021), online decision making (Shi et al, 2020;Chen, Lu and Song, 2021;Chen et al, 2022), dynamic treatment regimes (Qi and Liu, 2018;Luckett et al, 2019;Qi et al, 2020;Qi, Miao and Zhang, 2021), and online causal effect in two-sided market (Shi et al, 2022b).…”