Simulation models are extensively used to predict agricultural productivity and greenhouse gas emissions. However, the uncertainties of (reduced) model ensemble simulations have not been assessed systematically for variables affecting food security and climate change mitigation, within multi-species agricultural contexts. We report an international model comparison and benchmarking exercise, showing the potential of multi-model ensembles to predict productivity and nitrous oxide (N O) emissions for wheat, maize, rice and temperate grasslands. Using a multi-stage modelling protocol, from blind simulations (stage 1) to partial (stages 2-4) and full calibration (stage 5), 24 process-based biogeochemical models were assessed individually or as an ensemble against long-term experimental data from four temperate grassland and five arable crop rotation sites spanning four continents. Comparisons were performed by reference to the experimental uncertainties of observed yields and N O emissions. Results showed that across sites and crop/grassland types, 23%-40% of the uncalibrated individual models were within two standard deviations (SD) of observed yields, while 42 (rice) to 96% (grasslands) of the models were within 1Â SD of observed N O emissions. At stage 1, ensembles formed by the three lowest prediction model errors predicted both yields and N O emissions within experimental uncertainties for 44% and 33% of the crop and grassland growth cycles, respectively. Partial model calibration (stages 2-4) markedly reduced prediction errors of the full model ensemble E-median for crop grain yields (from 36% at stage 1 down to 4% on average) and grassland productivity (from 44% to 27%) and to a lesser and more variable extent for N O emissions. Yield-scaled N O emissions (N O emissions divided by crop yields) were ranked accurately by three-model ensembles across crop species and field sites. The potential of using process-based model ensembles to predict jointly productivity and N O emissions at field scale is discussed.