Combination of PCA with SMOTE Oversampling for Classification of High-Dimensional Imbalanced Data

Mulla, Guhdar A. A.; Demir, Yıldırım; Hassan, Masoud Muhammed

doi:10.17798/bitlisfen.939733

Cited by 10 publications

(8 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In datasets with class imbalance problem, most machine learning techniques ignore minority class performance and therefore underperform in minority class. One approach to these datasets is to oversample the minority class and is called the Synthetic Minority Oversampling Technique, or SMOTE for short (9). In order to eliminate the class imbalance problem in the colon cancer gene expression dataset (22 normal and 40 tumor tissues), the SMOTE method was applied before feature selection.…”

Section: Data Preprocessing and Modelingmentioning

confidence: 99%

Artificial Intelligence-based Colon Cancer Prediction by Identifying Genomic Biomarkers

PAKSOY

Yağın

2022

Medical Records

View full text Add to dashboard Cite

Colon cancer is the third most common type of cancer worldwide. Because of the poor prognosis and unclear preoperative staging, genetic biomarkers have become more important in the diagnosis and treatment of the disease. In this study, we aimed to determine the biomarker candidate genes for colon cancer and to develop a model that can predict colon cancer based on these genes. Material and Methods: In the study, a dataset containing the expression levels of 2000 genes from 62 different samples (22 healthy and 40 tumor tissues) obtained by the Princeton University Gene Expression Project and shared in the figshare database was used. Data were summarized as mean ± standard deviation. Independent Samples T-Test was used for statistical analysis. The SMOTE method was applied before the feature selection to eliminate the class imbalance problem in the dataset. The 13 most important genes that may be associated with colon cancer were selected with the LASSO feature selection method. Random Forest (RF), Decision Tree (DT), and Gaussian Naive Bayes methods were used in the modeling phase. Results: All 13 genes selected by LASSO had a statistically significant difference between normal and tumor samples. In the model created with RF, all the accuracy, specificity, f1-score, sensitivity, negative and positive predictive values were calculated as 1. The RF method offered the highest performance when compared to DT and Gaussian Naive Bayes. Conclusion:In the study, we identified the genomic biomarkers of colon cancer and classified the disease with a high-performance model. According to our results, it can be recommended to use the LASSO+RF approach when modeling high-dimensional microarray data.

show abstract

Section: Data Preprocessing and Modelingmentioning

confidence: 99%

Artificial Intelligence-based Colon Cancer Prediction by Identifying Genomic Biomarkers

PAKSOY

Yağın

2022

Medical Records

View full text Add to dashboard Cite

show abstract

“…It is the process of recognizing patterns, concepts, and other objects in order to better comprehend them and classify them based on incoming data [5]. Classification can help uncover abnormalities when developing a learning model from prior data [4]. There are various classification algorithms, each of which builds a prediction model in a different way.…”

Section: Classification Algorithmsmentioning

confidence: 99%

“…It can be used to create guesses regarding category variable names [34]. Each branch might be relegated to the training sample category [4]. The decision tree is formulated as follows:…”

Section: Decision Tree (Dt)mentioning

confidence: 99%

“…In this paper, we studied and analyzed autism data among children affected in the Dohuk governorate using Machine Learning (ML) methods. ML, as a branch of artificial intelligence, is the process of applying computers to real-world problems, to better analyze, train, and model data, and hence perform faster and provide better predictions [4]. For this purpose, a new dataset of autistic children in Duhok was created by collecting data from 515 cases in different centers.…”

Section: Introductionmentioning

confidence: 99%

“…ML algorithms are categorized into two main types, supervised (with labeled data) and unsupervised (unlabeled data) [4]. Classification, which is the main core of this study, is one of the most popular methods of supervised learning, which is based on predicting the output from a set of input variables [5].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Analysis and Classification of Autism Data Using Machine Learning Algorithms

Hassan

Taher

2022

SJUOZ

View full text Add to dashboard Cite

Autism is a neurodevelopmental disorder that affects children worldwide between the ages of 2 and 8 years. Children with autism have communication and social difficulties, and the current standardized clinical diagnosis of autism still relies on behaviour-based tests. The rapidly growing number of autistic patients in the Kurdistan Region of Iraq necessitates. However, such data are scarce, making extensive evaluations of autism screening procedures more difficult. For this purpose, the use of machine learning algorithms for this disease to assist health practitioners if formal clinical diagnosis should be pursued was investigated. Data from 515 patients were collected in Dohuk city related to autism screening for young children. Three classification algorithms, namely (DT, KNN, and ANN) were applied to diagnose and predict autism using various rating scales. Before applying the above classifiers, the newly obtained data set was in different ways undergo data reprocessing. Since our data is unbalanced with high dimensionality, we suggest combining SMOTE (Synthetic Minority Hyper sampling Technique) and PCA (Primary Component Analysis) to improve the performance of classification models. Experimental results showed that the combination of PCA and SMOTE methods improved classification performance. Moreover, ANN exceeded the other models in terms of accuracy and F1 score, suggesting that these classification methods could be used to diagnose autism in the future.

show abstract