“…Recently, there have been many offline RL methods proposed to implement such a penalty, by introducing support constrains [Fujimoto, Meger, and Precup 2019, Ghasemipour, Schuurmans, and Gu 2021 or policy regularization [Kostrikov, Nair, and Levine 2021, Fujimoto and Gu 2021, Peng et al 2019, Wu, Tucker, and Nachum 2019, Kumar et al 2019a. Alternatively, some offline RL methods introduce the uncertainty estimation [Rezaeifar et al 2021, Wu et al 2021, Ma, Jayaraman, and Bastani 2021, Bai et al 2022 or the conservation [Kumar et al 2020, Lyu et al 2022] over values to overcome the potential overestimation for OOD state actions. In the same spirit, model-based offline RL methods similarly employ the distribution (correcting) regularization [Hishinuma and Senda 2021, uncertain estimation [Kidambi et al 2020, and value conservation [Yu et al 2021] to eliminate the OOD issues.…”