“…On the other hand, Offline RL (Fujimoto et al, 2019;Kumar et al, 2020;Brandfonbrener et al, 2021;Kostrikov et al, 2021) removes all need for environment interaction or user simulators, instead operating purely on static datasets of prior human interaction. There are many closely related works (Jaques et al, 2019(Jaques et al, , 2020Snell et al, 2022;Cohen et al, 2022;Verma et al, 2022;Jang et al, 2022) based on offline RL that policy improvement via behavior cloning of self-generated utterances, which inherits the ability of pre-trained language models to generate human-like responses. In RL parlance, such methods could be considered policy extraction with approximate dynamic programming.…”