2019
DOI: 10.1016/j.ecolmodel.2019.108815
|View full text |Cite
|
Sign up to set email alerts
|

Importance of spatial predictor variable selection in machine learning applications – Moving from data reproduction to spatial prediction

Abstract: Machine learning algorithms find frequent application in spatial prediction of biotic and abiotic environmental variables. However, the characteristics of spatial data, especially spatial autocorrelation, are widely ignored. We hypothesize that this is problematic and results in models that can reproduce training data but are unable to make spatial predictions beyond the locations of the training samples. We assume that not only spatial validation strategies but also spatial variable selection is essential for… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

6
216
0
2

Year Published

2019
2019
2024
2024

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 277 publications
(224 citation statements)
references
References 47 publications
6
216
0
2
Order By: Relevance
“…This means that observations located within the range of spatial autocorrelation in the data, which exceeds 100 km in our case study on forest AGB in central Africa, should be treated as pseudoreplicates rather than new observations in model validation schemes. Ignoring spatial dependence in model cross-validation can create false confidence in model predictions and hide model overfitting, a problem that has been well documented in the recent literature on statistical ecology 30 , 32 , 35 , 38 . This issue was particularly evident in our results: (1) the model explained more than half of the AGB variation in the vicinity of sampling locations but completely failed to predict AGB at remote locations, showing that it was strongly locally overfitted, and (2) the random K -fold CV scheme did not allow the diagnosis of model overfitting and provided overly optimistic statistics of model predictive power.…”
Section: Discussionmentioning
confidence: 99%
“…This means that observations located within the range of spatial autocorrelation in the data, which exceeds 100 km in our case study on forest AGB in central Africa, should be treated as pseudoreplicates rather than new observations in model validation schemes. Ignoring spatial dependence in model cross-validation can create false confidence in model predictions and hide model overfitting, a problem that has been well documented in the recent literature on statistical ecology 30 , 32 , 35 , 38 . This issue was particularly evident in our results: (1) the model explained more than half of the AGB variation in the vicinity of sampling locations but completely failed to predict AGB at remote locations, showing that it was strongly locally overfitted, and (2) the random K -fold CV scheme did not allow the diagnosis of model overfitting and provided overly optimistic statistics of model predictive power.…”
Section: Discussionmentioning
confidence: 99%
“…This approach combines color and textural characteristics of pixel groups in combination with physical attributes such as depth, slope, significant wave height, and spatial relationships between the mapped categories (Hedley et al, 2016;Asner et al, 2017;Roelfsema et al, 2018;Li et al, 2019). Accuracy assessment methods of remotely sensed or predicted data with in-water data should be used to understand the uncertainty around this modeled data (Phillips and Marks, 1996;Meyer et al, 2019).…”
Section: Spatial Remote Sensingmentioning
confidence: 99%
“…The ability to collect geolocated photo quadrats representing a footprint along transects collected by divers in combination with automated photo analysis has increased our ability to get large amounts of data consistently whilst maintaining a record that could be revisited to assess the benthos condition and composition (Roelfsema and Phinn, 2010;Beijbom et al, 2015;Roelfsema et al, 2015;González-Rivero et al, 2016). How this information relates to corals outside of the survey area can be difficult and requires extensive sampling designs and uncertainty models (Phillips and Marks, 1996;Meyer et al, 2019). Sampling design in these shallow environments also needs to consider tide in relation to the application of monitoring techniques.…”
Section: In-water Monitoring Strategiesmentioning
confidence: 99%
“…Statistically, RF was the best performing algorithm considering the temporal evaluation and together with GBM statistically strong in the spatial prediction. However, Appelhans et al (2015) and Meyer et al (2019c) have shown that a visual inspection is a further important validation step and should not be neglected in the selection of models. In a visual inspection of the predictions, it was apparent that the RF predictions were strongly driven by linear bands caused by the hillshading and solar 5 variables (Fig.…”
Section: Visual Validation and Final Model Justificationmentioning
confidence: 99%