Plant phenology models are important components in process-based crop models, which are used to assess the impact of climate change on food production. For reliable model predictions, parameters in phenology models have to be accurately known. They are usually estimated by calibrating the model to observations. However, at regional scales in which different cultivars of a crop species may be grown, not accounting for inherent differences in phenological development between cultivars in the model and the presence of model deficits lead to inaccurate parameter estimates. To account for inherent differences between cultivars and to identify model deficits, we used a Bayesian multi-level approach to calibrate a phenology model (SPASS) to observations of silage maize grown across Germany between 2009 and 2017. We evaluated four multi-level models of increasing complexity, where we accounted for different combinations of ecological, weather, and year effects, as well as the hierarchical classification of cultivars nested within ripening groups of the maize species. We compared the calibration quality from this approach to the commonly used pooled approach in which none of these factors are considered. Our approach proved successful in improving calibration quality by incorporating the hierarchical classification of cultivars within ripening groups of maize. Our findings have implications for regional model calibration and data-gathering studies, since it emphasizes that ripening group and cultivar information is essential. Furthermore, we found that if this information is not available, at least weather, ecological regions and year effects should be taken into account. Our results can facilitate model improvement studies since we identified possible model limitations related to temperature effects in the reproductive (post-flowering) phase and to soil-moisture. We demonstrate that Bayesian multi-level calibration of a phenology model facilitates the incorporation of hierarchical dependencies and the identification of model limitations. Our approach can be extended to full crop models at different spatial scales.