Investigating the Impact of Data Analysis and Classification on Parametric and Nonparametric Machine Learning Techniques: A Proof of Concept

Khire, Sarvesh; Ganorkar, Pushkar; Apastamb, Aseem; Panicker, Suja

doi:10.1007/978-981-15-9647-6_17

Cited by 3 publications

(1 citation statement)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In binary classification, the LR model assesses the likelihood of the majority class. For instance, if LR predicts a probability of 0.1 for the default class and predicts a probability of 0.9 for the second class, the query point might be categorized as being in the second class [39].…”

Section: Logistic Regressionmentioning

confidence: 99%

SFC: A Sampling from Clusters for Reduction of Dataset Size

Tigga¹,

Pal²,

Mustafi³

2023

Preprint

View full text Add to dashboard Cite

Since managing enormous datasets in the real world is difficult, it is necessary to minimize the size of the data set, so that the accuracy of the original dataset is no longer impacted. In this study, the categorization of the white wine dataset is examined using a number of machine learning techniques, including Naive Bayes (NB), Support Vector Machine (SVM), Random Forest (RF), K Nearest Neighbour (KNN), and Logistic Regression (LR). Additionally, we utilized the stated dataset using the defined methodologies and presented the Sampling from Clusters (SFC) approach. The white wine dataset is first clustered using our suggested method SFC, and then 95% of the data from each cluster is removed and combined to create a standard dataset for classification process. For 90%, 85%, and 80% of the data, the same procedure is repeated. On the other hand, we used a random sampling (RS) technique to work with 95% of the data from the dataset in question, and we compared the results with SFC using evaluation metrics like accuracy, precision, recall, F1-score, Receiver Operating Characteristic (ROC), Area under the Curve (AUC), binomial confidence interval (CI), and MSE. With 90%, 85%, and 80% of the datasets, the same procedure is repeated. According to statistics, confidence intervals CI become tighter as the quantity of test data N increases; they range from 0.72 to 0.76 for NB, 0.73 to 0.79 for SVM, 0.82 to 0.86 for RF, 0.75 to 0.77 for KNN, and 0.74 to 0.80 for LR.

show abstract

Section: Logistic Regressionmentioning

confidence: 99%