Traditional methods for individual tree-crown (ITC) detection (image classification, segmentation, template matching, etc.) applied to very high-resolution remote sensing imagery have been shown to struggle in disparate landscape types or image resolutions due to scale problems and information complexity. Deep learning promised to overcome these shortcomings due to its superior performance and versatility, proven with reported detection rates of ~90%. However, such models still find their limits in transferability across study areas, because of different tree conditions (e.g., isolated trees vs. compact forests) and/or resolutions of the input data. This study introduces a highly replicable deep learning ensemble design for ITC detection and species classification based on the established single shot detector (SSD) model. The ensemble model design is based on varying the input data for the SSD models, coupled with a voting strategy for the output predictions. Very high-resolution unmanned aerial vehicles (UAV), aerial remote sensing imagery and elevation data are used in different combinations to test the performance of the ensemble models in three study sites with highly contrasting spatial patterns. The results show that ensemble models perform better than any single SSD model, regardless of the local tree conditions or image resolution. The detection performance and the accuracy rates improved by 3–18% with only as few as two participant single models, regardless of the study site. However, when more than two models were included, the performance of the ensemble models only improved slightly and even dropped.