A wide reach on cancer prediction and detection using Next Generation Sequencing (NGS) by the application of artificial intelligence is highly appreciated in the current scenario of the medical field. Next generation sequences were extracted from NCBI (National Centre for Biotechnology Information) gene repository. Sequences of normal Homo sapiens (Class 1), BRCA1 (Class 2) and BRCA2 (Class 3) were extracted for Machine Learning (ML) purpose. The total volume of datasets extracted for the process were 1580 in number under four categories of 50, 100, 150 and 200 sequences. The breast cancer prediction process was carried out in three major steps such as feature extraction, machine learning classification and performance evaluation. The features were extracted with sequences as input. Ten features of DNA sequences such as ORF (Open Reading Frame) count, individual nucleobase average count of A, T, C, G, AT and GC-content, AT/GC composition, G-quadruplex occurrence, MR (Mutation Rate) were extracted from three types of sequences for the classification process. The sequence type was also included as a target variable to the feature set with values 0, 1 and 2 for classes 1, 2 and 3 respectively. Nine various supervised machine learning techniques like LR (Logistic Regression statistical model), LDA (Linear Discriminant analysis model), k-NN (k nearest neighbours’ algorithm), DT (Decision tree technique), NB (Naive Bayes classifier), SVM (Support-Vector Machine algorithm), RF (Random Forest learning algorithm), AdaBoost (AB) and Gradient Boosting (GB) were employed on four various categories of datasets. Of all supervised models, decision tree machine learning technique performed most with maximum accuracy in classification of 94.03%. Classification model performance was evaluated using precision, recall, F1-score and support values wherein F1-score was most similar to the classification accuracy.
Breast cancer is the leading cancer in women, which accounts for millions of deaths worldwide. Early and accurate detection, prognosis, cure, and prevention of breast cancer is a major challenge to society. Hence, a precise and reliable system is vital for the classification of cancerous sequences. Machine learning classifiers contribute much to the process of early prediction and diagnosis of cancer. In this paper, a comparative study of four machine learning classifiers such as random forest, decision tree, AdaBoost, and gradient boosting is implemented for the classification of a benign and malignant tumor. To derive the most efficient machine learning model, NCBI datasets are utilized. Performance evaluation is conducted, and all four classifiers are compared based on the results. The aim of the work is to derive the most efficient machine-learning model for the diagnosis of breast cancer. It was observed that gradient boosting outperformed all other models and achieved a classification accuracy of 95.82%.
Breast cancer has become the greatest frequent cancer among worldwide. Machine learning techniques contribute much tocancer prognosis. The prime focus of thework is to enhance the prognosis of breast cancer at an earlier stage using an ensemble of machine learning classi ers. Next generation genetic sequences of homo sapiens,BRCA1and BRCA2from National Centre for Biotechnology Information were derived for prediction of breast cancer. The proposed ensembled classi ers by hard voting and soft voting,combinedmodelslike Decision Tree technique, SVMalgorithm, LR statistical model, Linear Discriminant analysis model, Naive Bayes classi er and k-nearest neighbours' algorithm.Five ensembled models from 6 machine learning classi ers were concatenated for the prediction purpose. Classi cation accuracy of ensemble hard voting and soft voting classi ers were evaluated statistically.Soft voting classi er for model 1(DT & SVM) and model2(DT, SVM&LR) achieved greatest value for classi cation performance metrics. Among all ensembled models, model 1 as well as model 2 achieved maximum classi cation precision of 94%.
Next Generation Sequencing is inevitable for providing better approach for predicting and curing diseases with high success rate in an appreciable timeline. Modern technology such as machine learning support the medical research with high speed and tremendous accuracy from disease prediction to cure. In this paper, the supervised learning model, Support Vector Machine is applied on next generation sequences for the prediction of breast cancer. Ten basic features of DNA sequences such as individual nucleobase average count of A, G, C, T, AT and GC-content, AT/GC composition, G-Quadruplex occurrence, ORF (Open Reading Frame) count and MR (Mutation Rate) are used for framing the feature vector. The feature vectors along with the class value are considered as the dataset for supervised learning. Datasets are prepared to classify (class value) as '0' for normal sequences, '1' for BRCA1 cancer sequences and '2' for BRCA2 cancer sequences. Four different categories of datasets are prepared with 50, 100, 150 and 200 sequences for each class of normal sequence, BRCA1 and BRCA2 cancer sequence. While increasing the dataset size, the outlier, the distribution and scattered features of data were also analysed. The datasets are split into training and testing set with 80:20 ratio for the classification process. SVM model in Python is applied for supervised classification process.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.