“…Parallel to its practical significance, recently there is a surge of theoretical investigations towards offline RL via two threads: offline policy evaluation (OPE), where the goal is to estimate the value of a target (fixed) policy V π (Li et al, 2015;Jiang & Li, 2016;Wang et al, 2017;Liu et al, 2018;Kallus & Uehara, 2020Uehara & Jiang, 2019;Feng et al, 2019;Nachum et al, 2019;Xie et al, 2019;Yin & Wang, 2020;Kato et al, 2020;Duan et al, 2020;Feng et al, 2020;Zhang et al, 2020b;Kuzborskij et al, 2020;Wang et al, 2020b;Zhang et al, 2021;Uehara et al, 2021;Nguyen-Tang et al, 2021;Hao et al, 2021;Xiao et al, 2021) and offline (policy) learning which intends to output a nearoptimal policy (Antos et al, 2008a,b;Chen & Jiang, 2019;Le et al, 2019;Xie & Jiang, 2020a,b;Liu et al, 2020b;Hao et al, 2020;Zanette, 2020;Jin et al, 2020c;Hu et al, 2021;Yin et al, 2021a;Ren et al, 2021;Rashidinejad et al, 2021). Yin et al (2021b) initiates the studies for offline RL from the new perspective of uniform convergence in OPE (uniform OPE for short) which unifies OPE and offline learning tasks.…”