“…Early works in this field, such as FQI and NFQ (Ernst et al, 2005;Riedmiller, 2005), termed the problem "batch" rather than offline, and didn't explicitly address the additional challenge that the batch mode brought to the table. Many other batch RL algorithms have since been proposed (Depeweg et al, 2016;Hein et al, 2018;Kaiser et al, 2020), which despite being offline in the sense that they do not interact with the environment, do not regularize their policy accordingly and instead assume a random data collection that makes generalization rather easy. Among the first to explicitly address the limitations in the offline setting were SPIBB(-DQN) (Laroche et al, 2019) in the discrete and BCQ (Fujimoto et al, 2019) in the continuous actions case.…”