“…The assessment of the performance of a given model version using observational benchmarks has also been actively discussed in the literature (Hoffman et al, 2017;Peng et al, 2014;Kelley et al, 2013;Luo 15 et al, 2012;Blyth et al, 2011;Randerson et al, 2009) and different frameworks have been proposed. Here we employ the Latin Hypercube Sampling (LHS) (McKay et al, 1979) approach, as used successfully in previous studies Battaglia et al, 2016;Steinacher and Joos, 2016;Battaglia and Joos, 2017;Zaehle et al, 2005). It allows simultaneous stratified sampling of a range of parameters, given an appropriate prior parameter distribution, while offering the opportunity to change evaluation metrics a posteriori, thus enabling a sensible incorporation of multiple observational constraints.…”