Litter production is a fundamental ecosystem process, which plays an important role in regulating terrestrial carbon and nitrogen cycles. However, there are substantial differences in the litter production simulations among ecosystem models, and a global benchmarking evaluation to measure the performance of these models is still lacking. In this study, we generated a global dataset of aboveground litterfall production (i.e. cLitter), a benchmark as the defined reference to test model performance, by combining systematic measurements taken from a substantial number of surveys (1079 sites) with a machine learning technique (i.e. random forest, RF). Our study demonstrated that the RF model is an effective tool for upscaling local litterfall production observations to the global scale. On average, the model predicted 23.15 Pg C yr −1 of aboveground litterfall production. Our results revealed substantial differences in the aboveground litterfall production simulations among the five investigated ecosystem models. Compared to the reference data at the global scale, most of models could reproduce the spatial patterns of aboveground litterfall production, but the magnitude of simulations differed substantially from the reference data. Overall, ORCHIDEE-MICT performed the best among the five investigated ecosystem models.