Train delays have been a serious persisting problem in the UK and also many other countries. Due to increasing demand, rail networks are running close to their full capacity. As a consequence, an initial delay can cause many knock-on delays to other trains, and this is the main reason for the overall deterioration in the performance of the rail networks. Therefore, it is really useful to have an AI-based method that can predict delays accurately and reliably, to help train controllers to make and apply alternative plans in time to reduce or prevent further delays, when a delay occurs. However, existing machine learning models are not only inaccurate but more importantly unreliable. In this study, we have proposed a new approach to build heterogeneous ensembles with two novel model selection methods based on accuracy and diversity. We tested our heterogeneous ensembles using the real-world data and the results indicated that they are more accurate and robust than single models and state-of-the-art homogeneous ensembles, e.g. Random Forest and XGBoost. We then verified their performances with an independent dataset from a different train operating company and found that they achieved the consistent and accurate results.