19Forecasting changes in species distribution under future scenarios is one of the most prolific 20 areas of application for species distribution models (SDMs). However, no consensus yet exists on 21 the reliability of such models for drawing conclusions on species distribution response to changing 22 42
43Keywords 44 Area Under the Curve (AUC), bias, climate change projections, disequilibrium, geographic extent, 45 sample size, Niche modelling, spurious relationships, True Skill Statistics (TSS). 46 47 Understanding how climate shapes species distribution and how range shifts may be driven by 49 future climatic change is more urgent than ever. In the last thirty years, studies aimed at developing, 50 improving and applying species distribution models (SDMs) have proliferated (Araújo et al. 2019), 51 and forecasting changes in species distribution under future scenarios is one of the most popular 52 areas of application for SDMs today (Thuiller et al. 2011, Schloss et al. 2012, Newbold 2018. In 53 SDM-based climate change forecasting studies, models are trained on current data and used to 54 predict the probability of presence under present and future conditions. Models' predictions are 55 often binarized, allowing one to assess whether a species distribution is expected to shift, contract 56 or expand (Newbold 2018). Although many modelling techniques require presence and absence 57 data, many models are fitted using presence-only data, i.e., contrasting presences with random 58 pseudo-absences, or background points, that represent available conditions (Guillera-Arroita et al. 59 2015). The predictive performance of these models is commonly assessed by randomly splitting the 60 dataset into training and testing, and fitting the model on the training dataset and validating it on the 61 testing dataset using discrimination metrics such as the True Skill Statistic (TSS) or the Area Under 62 the Curve (AUC). While several authors have warned about the challenges and uncertainties of 63 projecting future species distribution (Dormann 2007, Peterson et al. 2018), only few studies have 64 tested model performance with empirical data, reporting mixed results (Rapacciuolo et al. 2012, 65