“…A variety of contemporary OPE methods has been proposed, which can be mainly divided into three categories ): (i) Inverse propensity scoring (Precup 2000), such as Importance Sampling (IS) (Doroudi, Thomas, and Brunskill 2017). (ii) Direct methods that directly estimate the value functions of the evaluation policy (Nachum et al 2019;Zhang et al 2021;Yang et al 2022), including but not limited to model-based estimators (MB) (Paduraru 2013;Zhang et al 2021), valuebased estimators (Le, Voloshin, and Yue 2019) such as Fitted Q Evaluation (FQE), and minimax estimators (Zhang, Liu, and Whiteson 2020; Yue 2021) such as DualDICE (Yang et al 2020a). (iii) Hybrid methods combine aspects of both inverse propensity scoring and direct methods (Thomas and Brunskill 2016), such as DR (Jiang and Li 2016).…”