Machine learning algorithms find frequent application in spatial prediction of biotic and abiotic environmental variables. However, the characteristics of spatial data, especially spatial autocorrelation, are widely ignored. We hypothesize that this is problematic and results in models that can reproduce training data but are unable to make spatial predictions beyond the locations of the training samples. We assume that not only spatial validation strategies but also spatial variable selection is essential for reliable spatial predictions.We introduce two case studies that use remote sensing to predict land cover and the leaf area index for the "Marburg Open Forest", an open research and education site of Marburg University, Germany. We use the machine learning algorithm Random Forests to train models using non-spatial and spatial cross-validation strategies to understand how spatial variable selection affects the predictions.Our findings confirm that spatial cross-validation is essential in preventing overoptimistic model performance. We further show that highly autocorrelated predictors (such as geolocation variables, e.g. latitude, longitude) can lead to considerable overfitting and result in models that can reproduce the training data but fail in making spatial predictions. The problem becomes apparent in the visual assessment of the spatial predictions that show clear artefacts that can be traced back to a misinterpretation of the spatially autocorrelated predictors by the algorithm. Spatial variable selection could automatically detect and remove such variables that lead to overfitting, resulting in reliable spatial prediction patterns and improved statistical spatial model performance.We conclude that in addition to spatial validation, a spatial variable selection must be considered in spatial predictions of ecological data to produce reliable predictions.
Funding information
Methods in Ecology and EvolutionThis article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as
Spatial predictions of near-surface air temperature (T air ) in Antarctica are required as baseline information for a variety of research disciplines. Since the network of weather stations in Antarctica is sparse, remote sensing methods have large potential due to their capabilities and accessibility. Based on the MODIS land surface temperature (LST) data, T air at the exact time of satellite overpass was modelled at a spatial resolution of 1 km using data from 32 weather stations. The performance of a simple linear regression model to predict T air from LST was compared to the performance of three machine learning algorithms: Random Forest (RF), generalized boosted regression models (GBM) and Cubist. In addition to LST, auxiliary predictor variables were tested in these models. Their relevance was evaluated by a Cubist-based forward feature selection in conjunction with leave-one-station-out cross-validation to reduce the impact of spatial overfitting. GBM performed best to predict T air using LST and the month of the year as predictor variables. Using the trained model, T air could be estimated with a leave-one-station-out cross-validated R 2 of 0.71 and a RMSE of 10.51 • C. However, the machine learning approaches only slightly outperformed the simple linear estimation of T air from LST (R 2 of 0.64, RMSE of 11.02 • C). Using the trained model allowed creating time series of T air over Antarctica for 2013. Extending the training data by including more years will allow developing time series of T air from 2000 on.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.