“…In the computational reinforcement-learning literature, this reality has called into question longstanding approaches to model-based reinforcement learning (Littman, 2015 ; Sutton, 1991 ; Sutton & Barto, 1998 ) which use standard maximum-likelihood estimation techniques that endeavor to learn the exact model (𝒰, 𝒯) that governs the underlying MDP. The end result has been a flurry of recent work (Abachi et al, 2020 ; Asadi et al, 2018 ; Ayoub et al, 2020 ; Cui et al, 2020 ; D’Oro et al, 2020 ; Farahmand, 2018 ; Farahmand et al, 2017 ; Grimm et al, 2020 , 2021 , 2022 ; Nair et al, 2020 ; Nikishin et al, 2022 ; Oh et al, 2017 ; Schrittwieser et al, 2020 ; Silver et al, 2017 ; Voelcker et al, 2022 ) which eschews the traditional maximum-likelihood objective in favor of various surrogate objectives which restrict the focus of the agent’s modeling towards specific aspects of the environment. As the core goal of endowing a decision-making agent with its own internal model of the world is to facilitate model-based planning (Bertsekas, 1995 ), central among these recent approaches is the value-equivalence principle (Grimm et al, 2020 , 2021 , 2022 ) which provides mathematical clarity on how surrogate models can still enable lossless planning relative to the true model of the environment.…”