Monitoring biodiversity is paramount to manage and protect natural resources, particularly in times of global change. Collecting images of organisms over large temporal or spatial scales is a promising practice to monitor and study biodiversity change of natural ecosystems, providing large amounts of data with minimal interference with the environment. Deep learning models are currently used to automate classification of organisms into taxonomic units. However, imprecision in these classifiers introduce a measurement noise that is difficult to control and can significantly hinder the analysis and interpretation of data. In our study, we show that this limitation can be overcome by ensembles of Data-efficient image Transformers (DeiTs), which significantly outperform the previous state of the art (SOTA). We validate our results on a large number of ecological imaging datasets of diverse origin, and organisms of study ranging from plankton to insects, birds, dog breeds, animals in the wild, and corals. On all the data sets we test, we achieve a new SOTA, with a reduction of the error with respect to the previous SOTA ranging from 18.48% to 87.50%, depending on the data set, and often achieving performances very close to perfect classification. The main reason why ensembles of DeiTs perform better is not due to the single-model performance of DeiTs, but rather to the fact that predictions by independent models have a smaller overlap, and this maximizes the profit gained by ensembling. This positions DeiT ensembles as the best candidate for image classification in biodiversity monitoring.