Tenfold Bootstrap Procedure for Support Vector Machines

Vrigazova, Borislava; Ivanov, Ivan

doi:10.7494/csci.2020.21.2.3634

Cited by 12 publications

(9 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Despite the adaptation of the mixed linear integer approach to classification models, it still performs slower predictions than traditional machine learning methods (Vrigazova & Ivanov, 2020b). Therefore, improving one class of methods does not guarantee the fastest classification.…”

Section: Discussionmentioning

confidence: 99%

“…The aims of this paper are first to check if the tenfold bootstrap has the computational advantage as a training/test splitting method in classification methods. Similar research was conducted for the Support Vector Machines, so this paper can be considered as an extension of (Vrigazova and Ivanov, 2020b). Secondly, to propose train/test split proportion for the bootstrap procedure to reduce computational time and preserve high accuracy of the classification model.…”

Section: Introductionmentioning

confidence: 90%

See 1 more Smart Citation

The Proportion for Splitting Data into Training and Test Set for the Bootstrap in Classification Problems

Vrigazova

2021

Business Systems Research Journal

Self Cite

View full text Add to dashboard Cite

Background: The bootstrap can be alternative to cross-validation as a training/test set splitting method since it minimizes the computing time in classification problems in comparison to the tenfold cross-validation. Objectives: Тhis research investigates what proportion should be used to split the dataset into the training and the testing set so that the bootstrap might be competitive in terms of accuracy to other resampling methods. Methods/Approach: Different train/test split proportions are used with the following resampling methods: the bootstrap, the leave-one-out cross-validation, the tenfold cross-validation, and the random repeated train/test split to test their performance on several classification methods. The classification methods used include the logistic regression, the decision tree, and the k-nearest neighbours. Results: The findings suggest that using a different structure of the test set (e.g. 30/70, 20/80) can further optimize the performance of the bootstrap when applied to the logistic regression and the decision tree. For the k-nearest neighbour, the tenfold cross-validation with a 70/30 train/test splitting ratio is recommended. Conclusions: Depending on the characteristics and the preliminary transformations of the variables, the bootstrap can improve the accuracy of the classification problem.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 90%

The Proportion for Splitting Data into Training and Test Set for the Bootstrap in Classification Problems

Vrigazova

2021

Business Systems Research Journal

Self Cite

View full text Add to dashboard Cite

show abstract

“…To gain an insight into the characteristics of the dataset and avoid bias and overfitting in machine learning algorithms, we perform 10-fold cross-validation to evaluate the predictive accuracy of all obtained models. , The whole observation data set was randomly divided into 10 parts, of which 9 parts are taken as a training set and the rest part is used for testing. This process is iterated 10 times.…”

Section: Methodsmentioning

confidence: 99%

Assessing the Spatiotemporal Characteristics, Factor Importance, and Health Impacts of Air Pollution in Seoul by Integrating Machine Learning into Land-Use Regression Modeling at High Spatiotemporal Resolutions

Hong

et al. 2023

Environ. Sci. Technol.

View full text Add to dashboard Cite

Previous studies have characterized spatial patterns of air pollution with land-use regression (LUR) models. However, the spatiotemporal characteristics of air pollution, the contribution of various factors to them, and the resultant health impacts have yet to be evaluated comprehensively. This study integrates machine learning (random forest) into LUR modeling (LURF) with intensive evaluations to develop high spatiotemporal resolution prediction models to estimate daily and diurnal PM 2.5 and NO 2 in Seoul, South Korea, at the spatial resolution of 500 m for a year (2019) and to then evaluate the contribution of driving factors and quantify the resultant premature mortality. Our results show that incorporating the random forest algorithm into our LUR model improves the model performance. Meteorological conditions have a great influence on daily models, while land-use factors play important roles in diurnal models. Our health assessment using dynamic population data estimates that PM 2.5 and NO 2 pollution, when combined, causes a total of 11,183 (95% CI: 5837−16,354) premature mortalities in Seoul in 2019, of which 64.9% are due to PM 2.5 , while the remaining are attributable to NO 2 . The air pollution-attributable health impacts in Seoul are largely caused by cardiovascular diseases including stroke. This study pinpoints the significant spatiotemporal variations and health impact of PM 2.5 and NO 2 in Seoul, providing essential data for epidemiological research and air quality management.

show abstract

“…For running ANOVA, PCA and the logistic regression, existing functions in Python (LogisticRegression(), sklearm.decomposition.PCA() and sklearn.Pipeline (ANOVA)) are used, while a script for running the tenfold bootstrap is created by the author. The tenfold bootstrap used in step 5 and its software realization in Python 3.6 can be found in author's previous study (Vrigazova, 2020). Right part of figure 1 illustrates the proposed algorithm.…”

Section: New Approach -The Anova Bootstrapped Pcamentioning

confidence: 99%

ANOVA bootstrapped principal components analysis for logistic regression

Toleva

2022

Croatian Review of Economic, Business and Social Statistics

View full text Add to dashboard Cite

Principal components analysis (PCA) is often used as a dimensionality reduction technique. A small number of principal components is selected to be used in a classification or a regression model to boost accuracy. A central issue in the PCA is how to select the number of principal components. Existing algorithms often result in contradictions and the researcher needs to manually select the final number of principal components to be used. In this research the author proposes a novel algorithm that automatically selects the number of principal components. This is achieved based on a combination of ANOVA ranking of principal components, the bootstrap and classification models. Unlike the classical approach, the algorithm we propose improves the accuracy of the logistic regression and selects the best combination of principal components that may not necessarily be ordered. The ANOVA bootstrapped PCA classification we propose is novel as it automatically selects the number of principal components that would maximise the accuracy of the classification model.

show abstract

Tenfold Bootstrap Procedure for Support Vector Machines

Cited by 12 publications

References 22 publications

The Proportion for Splitting Data into Training and Test Set for the Bootstrap in Classification Problems

The Proportion for Splitting Data into Training and Test Set for the Bootstrap in Classification Problems

Assessing the Spatiotemporal Characteristics, Factor Importance, and Health Impacts of Air Pollution in Seoul by Integrating Machine Learning into Land-Use Regression Modeling at High Spatiotemporal Resolutions

ANOVA bootstrapped principal components analysis for logistic regression

Contact Info

Product

Resources

About