57Predicting wheat phenology is important for cultivar selection, for effective crop 58 management and provides a baseline for evaluating the effects of global change. Evaluating 59 how well crop phenology can be predicted is therefore of major interest. Twenty-eight wheat 60 modeling groups participated in this evaluation. Model predictions depend not only on model 61 structure but also on the parameter values. This study is thus an evaluation of modeling groups, 62 which choose the structure and fix or estimate the parameters, rather than an evaluation just of 63 model structures. Our target population was wheat fields in the major wheat growing regions 64 of Australia under current climatic conditions and with current local management practices. 65The environments used for calibration and for evaluation were both sampled from this same 66 target population. The calibration and evaluation environments had neither sites nor years in 67 common, so this is a rigorous evaluation of the ability of modeling groups to predict phenology 68 for new sites and weather conditions. Mean absolute error (MAE) for the evaluation 69 environments, averaged over predictions of three phenological stages and over modeling 70 groups, was 9 days, with a range from 6 to 20 days. Predictions using the multi-modeling group 71 mean and median had prediction errors nearly as small as the best modeling group. For a given 72 modeling group, MAE for the evaluation environments was significantly correlated with MAE 73 for the calibration environments, which suggests that it would be of interest to test ensemble 74 predictors that weight individual modeling groups based on performance for the calibration 75 data. About two thirds of the modeling groups performed better than a simple but relevant 76 benchmark, which predicts phenology by assuming a constant temperature sum for each 77 development stage. The added complexity of crop models beyond just the effect of temperature 78 was thus justified in most cases. Finally, there was substantial variability between modeling 79 groups using the same model structure, which implies that model improvement could be 80 4 achieved not only by improving model structure, but also by improving parameter values, and 81 in particular by improving calibration techniques. 82 uncertainty 84 85 5 A second aspect of evaluation that must be specified is the modeling group or groups 131 that are being evaluated. We reserve the term "model" specifically for model structure, i.e. the 132 model equations, while modeling group determines both the model structure and the parameter 133 values, which are chosen or estimated by the group running the model. It is clear that predictions 134