“…Benchmarking has also been applied to land surface models, including for projects such as PILPS and PLUMBER (Abramowitz, 2012;Best et al, 2015;Haughton et al, 2016;Henderson-Sellers et al, 1996). More recently, model intercomparison and benchmarking projects have included DMIP and IHM-MIP projects for distributed models (e.g., Kollet et al, 2017;Maxwell et al, 2014;Smith et al, 2004Smith et al, , 2012Smith et al, , 2013; the Great Lakes Model Intercomparison project (e.g., Mai et al, 2022); benchmarking of NLDAS land surface models (e.g., Nearing et al, 2016Nearing et al, , 2018; and the testing of model ensembles (Pappenberger et al, 2015). These have taken the form either of testing which model provides the best simulations according to some metric (often using a split record test, e.g., Knoben et al, 2019); or testing against a benchmark model, either a chosen conceptual hydrological model (e.g., Newman et al, 2017;Seibert et al, 2018) or a purely data-based or machine learning model (e.g.…”