Random Forests variable importance measures are often used to rank variables by their relevance to a classification problem and subsequently reduce the number of model inputs in high-dimensional data sets, thus increasing computational efficiency. However, as a result of the way that training data and predictor variables are randomly selected for use in constructing each tree and splitting each node, it is also well known that if too few trees are generated, variable importance rankings tend to differ between model runs. In this letter, we characterize the effect of the number of trees (ntree) and class separability on the stability of variable importance rankings and develop a systematic approach to define the number of model runs and/or trees required to achieve stability in variable importance measures. Results demonstrate that both a large ntree for a single model run, or averaged values across multiple model runs with fewer trees, are sufficient for achieving stable mean importance values. While the latter is far more computationally efficient, both the methods tend to lead to the same ranking of variables. Moreover, the optimal number of model runs differs depending on the separability of classes. Recommendations are made to users regarding how to determine the number of model runs and/or trees that are required to achieve stable variable importance rankings. Index Terms-Mean decrease in accuracy (MDA), mean decrease in Gini (MDG) index, random forest, variable reduction.
Abstract:In this study, a new method is proposed for semi-automated surface water detection using synthetic aperture radar data via a combination of radiometric thresholding and image segmentation based on the simple linear iterative clustering superpixel algorithm. Consistent intensity thresholds are selected by assessing the statistical distribution of backscatter values applied to the mean of each superpixel. Higher-order texture measures, such as variance, are used to improve accuracy by removing false positives via an additional thresholding process used to identify the boundaries of water bodies. Results applied to quad-polarized RADARSAT-2 data show that the threshold value for the variance texture measure can be approximated using a constant value for different scenes, and thus it can be used in a fully automated cleanup procedure. Compared to similar approaches, errors of omission and commission are improved with the proposed method. For example, we observed that a threshold-only approach consistently tends to underestimate the extent of water bodies compared to combined thresholding and segmentation, mainly due to the poor performance of the former at the edges of water bodies. The proposed method can be used for monitoring changes in surface water extent within wetlands or other areas, and while presented for use with radar data, it can also be used to detect surface water in optical images.
To better understand and mitigate threats to the long-term health and functioning of wetlands, there is need to establish comprehensive inventorying and monitoring programs. Here, remote sensing data and machine learning techniques that could support or substitute traditional field-based data collection are evaluated. For the Bay of Quinte on Lake Ontario, Canada, different combinations of multi-angle/temporal quad pol RADARSAT-2, simulated compact pol RADARSAT Constellation Mission (RCM), and high and low spatial resolution Digital Elevation and Surface Models (DEM and DSM, respectively) were used to classify six land cover classes with Random Forests: shallow water, marsh, swamp, water, forest, and agriculture/non-forested. Results demonstrate that high accuracies can be achieved with multi-temporal SAR data alone (e.g., user’s and producer’s accuracies ≥90% for a model based on a spring image and a summer image), or via fusion of SAR and DEM and DSM data for single dates/incidence angles (e.g., user’s and producer’s accuracies ≥90% for a model based on a spring image, DEM, and DSM data). For all models based on single SAR images, simulated compact pol data generally achieved lower accuracies than quad pol RADARSAT-2 data. However, it was possible to compensate for observed differences through either multi-temporal/angle data fusion or the inclusion of DEM and DSM data (i.e., as a result, there was not a statistically significant difference between multiple models). With a higher repeat-pass cycle than RADARSAT-2, RCM is expected to be a reliable source of C-band SAR data that will contribute positively to ongoing efforts to inventory wetlands and monitor change in areas containing the same land cover classes evaluated here.
Detailed information on the land cover types present and the horizontal position of the land-water interface is needed for sensitive coastal ecosystems throughout the Arctic, both to establish baselines against which the impacts of climate change can be assessed and to inform response operations in the event of environmental emergencies such as oil spills. Previous work has demonstrated potential for accurate classification via fusion of optical and SAR data, though what contribution either makes to model accuracy is not well established, nor is it clear what shorelines can be classified using optical or SAR data alone. In this research, we evaluate the relative value of quad pol RADARSAT-2 and Landsat 5 data for shoreline mapping by individually excluding both datasets from Random Forest models used to classify images acquired over Nunavut, Canada. In anticipation of the RADARSAT Constellation Mission (RCM), we also simulate and evaluate dual and compact polarimetric imagery for shoreline mapping. Results show that SAR data is needed for accurate discrimination of substrates as user's and producer's accuracies were 5-24% higher for models constructed with quad pol RADARSAT-2 and DEM data than models constructed with Landsat 5 and DEM data. Models based on simulated RCM and DEM data achieved significantly lower overall accuracies (71-77%) than models based on quad pol RADARSAT-2 and DEM data (80%), with Wetland and Tundra being most adversely affected. When classified together with Landsat 5 and DEM data, however, model accuracy was less affected by the SAR data type, with multiple polarizations and modes achieving independent overall accuracies within a range acceptable for operational mapping, at 89-91%. RCM is expected to contribute positively to ongoing efforts to monitor change and improve emergency preparedness throughout the Arctic.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.