Understanding the spatial distribution of soil organic carbon (SOC) content over different climatic regions will enhance our knowledge of carbon gains and losses due to climatic change. However, little is known about the SOC content in the contrasting arid and sub-humid regions of Iran, whose complex SOC–landscape relationships pose a challenge to spatial analysis. Machine learning (ML) models with a digital soil mapping framework can solve such complex relationships. Current research focusses on ensemble ML models to increase the accuracy of prediction. The usual ensemble method is boosting or weighted averaging. This study proposes a novel ensemble technique: the stacking of multiple ML models through a meta-learning model. In addition, we tested the ensemble through rescanning the covariate space to maximize the prediction accuracy. We first applied six state-of-the-art ML models (i.e., Cubist, random forests (RF), extreme gradient boosting (XGBoost), classical artificial neural network models (ANN), neural network ensemble based on model averaging (AvNNet), and deep learning neural networks (DNN)) to predict and map the spatial distribution of SOC content at six soil depth intervals for both regions. In addition, the stacking of multiple ML models through a meta-learning model with/without rescanning the covariate space were tested and applied to maximize the prediction accuracy. Out of six ML models, the DNN resulted in the best modeling accuracies, followed by RF, XGBoost, AvNNet, ANN, and Cubist. Importantly, the stacking of models indicated a significant improvement in the prediction of SOC content, especially when combined with rescanning the covariate space. For instance, the RMSE values for SOC content prediction of the upper 0–5 cm of the soil profiles of the arid site and the sub-humid site by the proposed stacking approaches were 17% and 9% respectively, less than that obtained by the DNN models—the best individual model. This indicates that rescanning the original covariate space by a meta-learning model can extract more information and improve the SOC content prediction accuracy. Overall, our results suggest that the stacking of diverse sets of models could be used to more accurately estimate the spatial distribution of SOC content in different climatic regions.
Most common machine learning (ML) algorithms usually work well on balanced training sets, that is, datasets in which all classes are approximately represented equally. Otherwise, the accuracy estimates may be unreliable and classes with only a few values are often misclassified or neglected. This is known as a class imbalance problem in machine learning and datasets that do not meet this criterion are referred to as imbalanced data. Most datasets of soil classes are, therefore, imbalanced data. One of our main objectives is to compare eight resampling strategies that have been developed to counteract the imbalanced data problem. We compared the performance of five of the most common ML algorithms with the resampling approaches.The highest increase in prediction accuracy was achieved with SMOTE (the synthetic minority oversampling technique). In comparison to the baseline prediction on the original dataset, we achieved an increase of about 10, 20 and 10% in the overall accuracy, kappa index and F-score, respectively. Regarding the ML approaches, random forest (RF) showed the best performance with an overall accuracy, kappa index and F-score of 66, 60 and 57%, respectively. Moreover, the combination of RF and SMOTE improved the accuracy of the individual soil classes, compared to RF trained on the original dataset and allowed better prediction of soil classes with a low number of samples in the corresponding soil profile database, in our case for Chernozems. Our results show that balancing existing soil legacy data using synthetic sampling strategies can significantly improve the prediction accuracy in digital soil mapping (DSM).
Highlights• Spatial distribution of soil classes in Iran can be predicted using machine learning (ML) algorithms. • The synthetic minority oversampling technique overcomes the drawback of imbalanced and highly biased soil legacy data. • When combining a random forest model with synthetic sampling strategies the prediction accuracy of the soil model improves significantly.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.