In this paper, we investigate whether deep learning models for species classification in camera trap images are well calibrated, i.e. whether predicted confidence scores can be reliably interpreted as probabilities that the predictions are true. Additionally, as camera traps are often configured to take multiple photos of the same event, we also explore the calibration of predictions at the sequence level.Here, we (i) train deep learning models on a large and diverse European camera trap dataset, using five established architectures; (ii) compare their calibration and classification performances on three independent test sets; (iii) measure the performances at sequence level using four approaches to aggregate individuals predictions; (iv) study the effect and the practicality of a post-hoc calibration method, for both image and sequence levels.Our results first suggest that calibration and accuracy are closely intertwined and vary greatly across model architectures. Secondly, we observe that averaging the logits over the sequence before applying softmax normalization emerges as the most effective method for achieving both good calibration and accuracy at the sequence level. Finally, temperature scaling can be a practical solution to further improve calibration, given the generalizability of the optimum temperature across datasets.We conclude that, with adequate methodology, deep learning models for species classification can be very well calibrated. This considerably improves the interpretability of the confidence scores and their usability in ecological downstream tasks.