A wide reach on cancer prediction and detection using Next Generation Sequencing (NGS) by the application of artificial intelligence is highly appreciated in the current scenario of the medical field. Next generation sequences were extracted from NCBI (National Centre for Biotechnology Information) gene repository. Sequences of normal Homo sapiens (Class 1), BRCA1 (Class 2) and BRCA2 (Class 3) were extracted for Machine Learning (ML) purpose. The total volume of datasets extracted for the process were 1580 in number under four categories of 50, 100, 150 and 200 sequences. The breast cancer prediction process was carried out in three major steps such as feature extraction, machine learning classification and performance evaluation. The features were extracted with sequences as input. Ten features of DNA sequences such as ORF (Open Reading Frame) count, individual nucleobase average count of A, T, C, G, AT and GC-content, AT/GC composition, G-quadruplex occurrence, MR (Mutation Rate) were extracted from three types of sequences for the classification process. The sequence type was also included as a target variable to the feature set with values 0, 1 and 2 for classes 1, 2 and 3 respectively. Nine various supervised machine learning techniques like LR (Logistic Regression statistical model), LDA (Linear Discriminant analysis model), k-NN (k nearest neighbours’ algorithm), DT (Decision tree technique), NB (Naive Bayes classifier), SVM (Support-Vector Machine algorithm), RF (Random Forest learning algorithm), AdaBoost (AB) and Gradient Boosting (GB) were employed on four various categories of datasets. Of all supervised models, decision tree machine learning technique performed most with maximum accuracy in classification of 94.03%. Classification model performance was evaluated using precision, recall, F1-score and support values wherein F1-score was most similar to the classification accuracy.
Breast cancer is the leading cancer in women, which accounts for millions of deaths worldwide. Early and accurate detection, prognosis, cure, and prevention of breast cancer is a major challenge to society. Hence, a precise and reliable system is vital for the classification of cancerous sequences. Machine learning classifiers contribute much to the process of early prediction and diagnosis of cancer. In this paper, a comparative study of four machine learning classifiers such as random forest, decision tree, AdaBoost, and gradient boosting is implemented for the classification of a benign and malignant tumor. To derive the most efficient machine learning model, NCBI datasets are utilized. Performance evaluation is conducted, and all four classifiers are compared based on the results. The aim of the work is to derive the most efficient machine-learning model for the diagnosis of breast cancer. It was observed that gradient boosting outperformed all other models and achieved a classification accuracy of 95.82%.
Breast cancer has become the greatest frequent cancer among worldwide. Machine learning techniques contribute much tocancer prognosis. The prime focus of thework is to enhance the prognosis of breast cancer at an earlier stage using an ensemble of machine learning classi ers. Next generation genetic sequences of homo sapiens,BRCA1and BRCA2from National Centre for Biotechnology Information were derived for prediction of breast cancer. The proposed ensembled classi ers by hard voting and soft voting,combinedmodelslike Decision Tree technique, SVMalgorithm, LR statistical model, Linear Discriminant analysis model, Naive Bayes classi er and k-nearest neighbours' algorithm.Five ensembled models from 6 machine learning classi ers were concatenated for the prediction purpose. Classi cation accuracy of ensemble hard voting and soft voting classi ers were evaluated statistically.Soft voting classi er for model 1(DT & SVM) and model2(DT, SVM&LR) achieved greatest value for classi cation performance metrics. Among all ensembled models, model 1 as well as model 2 achieved maximum classi cation precision of 94%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.