Collinearity refers to the non independence of predictor variables, usually in a regression‐type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold‐based pre‐selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor‐response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine‐learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold‐based pre‐selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold‐based pre‐selection when omitted variables are considered in the final interpretation. However, all approaches tested yielded degraded predictions under change in collinearity structure and the ‘folk lore’‐thresholds of correlation coefficients between predictor variables of |r| >0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre‐analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.
Species distributional or trait data based on range map (extent-of-occurrence) or atlas survey data often display spatial autocorrelation, i.e. locations close to each other exhibit more similar values than those further apart. If this pattern remains present in the residuals of a statistical model based on such data, one of the key assumptions of standard statistical analyses, that residuals are independent and identically distributed (i.i.d), is violated. The violation of the assumption of i.i.d. residuals may bias parameter estimates and can increase type I error rates (falsely rejecting the null hypothesis of no effect). While this is increasingly recognised by researchers analysing species distribution data, there is, to our knowledge, no comprehensive overview of the many available spatial statistical methods to take spatial autocorrelation into account in tests of statistical significance. Here, we describe six different statistical approaches to infer correlates of species' distributions, for both presence/absence (binary response) and species abundance data (poisson or normally distributed response), while accounting for spatial autocorrelation in model residuals: autocovariate regression; spatial eigenvector mapping; generalised least squares; (conditional and simultaneous) autoregressive models and generalised estimating equations. A comprehensive comparison of the relative merits of these methods is beyond the scope of this paper. To demonstrate each method's implementation, however, we undertook preliminary tests based on simulated data. These preliminary tests verified that most of the spatial modeling techniques we examined showed good type I error control and precise parameter estimates, at least when confronted with simplistic simulated data containing
Aim Spatial autocorrelation is a frequent phenomenon in ecological data and can affect estimates of model coefficients and inference from statistical models. Here, we test the performance of three different simultaneous autoregressive (SAR) model types (spatial error = SAR err , lagged = SAR lag and mixed = SAR mix ) and common ordinary least squares (OLS) regression when accounting for spatial autocorrelation in species distribution data using four artificial data sets with known (but different) spatial autocorrelation structures. MethodsWe evaluate the performance of SAR models by examining spatial patterns in model residuals (with correlograms and residual maps), by comparing model parameter estimates with true values, and by assessing their type I error control with calibration curves. We calculate a total of 3240 SAR models and illustrate how the best models [in terms of minimum residual spatial autocorrelation (minRSA), maximum model fit ( R 2 ), or Akaike information criterion (AIC)] can be identified using model selection procedures. ResultsOur study shows that the performance of SAR models depends on model specification (i.e. model type, neighbourhood distance, coding styles of spatial weights matrices) and on the kind of spatial autocorrelation present. SAR model parameter estimates might not be more precise than those from OLS regressions in all cases. SAR err models were the most reliable SAR models and performed well in all cases (independent of the kind of spatial autocorrelation induced and whether models were selected by minRSA, R 2 or AIC), whereas OLS, SAR lag and SAR mix models showed weak type I error control and/or unpredictable biases in parameter estimates.Main conclusions SAR err models are recommended for use when dealing with spatially autocorrelated species distribution data. SAR lag and SAR mix might not always give better estimates of model coefficients than OLS, and can thus generate bias. Other spatial modelling techniques should be assessed comprehensively to test their predictive performance and accuracy for biogeographical and macroecological research.
Spatial (two-dimensional) distributions in ecology are often influenced by spatial autocorrelation. In standard regression models, however, observations are assumed to be statistically independent. In this paper we present an alternative to other methods that allow for autocorrelation. We show that the theory of wavelets provides an efficient method to remove autocorrelations in regression models using data sampled on a regular grid. Wavelets are particularly suitable for data analysis without any prior knowledge of the underlying correlation structure. We illustrate our new method, called wavelet-revised model, by applying it to multiple regression for both normal linear models and logistic regression. Results are presented for computationally simulated data and real ecological data (distribution of species richness and distribution of the plant species Dianthus carthusianorum throughout Germany). These results are compared to those of generalized linear models and models based on generalized estimating equations. We recommend wavelet-revised models, in particular, as a method for logistic regression using large datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.