<div id="titleAndAbstract"><p class="0abstract">Breast cancer poses the greatest threat to human life and especially to women's life. Despite the progress made in data mining technology in recent years, the ability to predict and diagnose such fatal diseases based on gene expression data still reveals a limited prediction performance, which may not be surprising since most of the genes in expression data are believed to be irrelevant or redundant. The dimensionality reduction process may be considered as a crucial step to analyze gene expression data, as it can reduce the high dimensionality of the breast cancer datasets, which may result into a better prediction performance of such diseases. The paper suggests a new hybrid approach-based gene selection that combines the filter method and the Ant Colony Optimization algorithm to find the smallest subset of informative genes (genes markers) among 24,481 genes. The proposed approach combines four machine learning algorithms - C5.0 Decision Tree, Support Vector Machines, K-Nearest Neighbors algorithm, and Random Forest Classifier - to classify each of the selected samples (patients) into two classes which have cancer or not. Compared with existing methods in the literature, experimental results indicate that our proposed gene selection approach achieved globally higher classification accuracies with a relatively smaller number of genes.</p></div>
Cancer tumor prediction and diagnosis at an early stage has become a necessity in cancer research, as it provides an increase in the treatment success chances. Recently, DNA microarray technology became a powerful tool for cancer identification, that can analyze the expression level of a different and huge number of genes simultaneously. In microarray data, the large genes number versus a few records may affect the prediction performance. In order to handle this "curse of dimensionality” constraint of microarray dataset while improving the cancer identification performance, a dimensional reduction phase is necessary. In this paper, we proposed a framework that combines dimensional reduction methods and machine learning algorithms in order to achieve the best cancer prediction performance using different microarray datasets. In the dimensional reduction phase, a combination of feature selection and feature extraction techniques was proposed. Pearson and Ant Colony Optimization was used to select the most important genes. Principal Component Analysis and Kernel Principal Component Analysis were used to linearly and non-linearly transform the selected genes to a new reduced space. In the cancer identification phase, we proposed four algorithms C5.0, Logistic Regression, Artificial Neural Network, and Support Vector Machine. Experimental results demonstrated that the framework performs effectively and competitively compared to state-of-the-art methods.
ABSTRAK: Ramalan tumor kanser dan diagnosis pada peringkat awal telah menjadi keperluan dalam kajian kanser, kerana ia membuka peluang peningkatan kejayaan dalam rawatan. Kebelakangan ini, teknologi mikrotatasusunan DNA menjadi alat berkuasa bagi mengenal pasti kanser, di mana ia mampu menganalisa level ekspresi yang pelbagai dan gen-gen yang banyak secara serentak. Dalam data mikrotatasusunan, gen-gen yang banyak ini bakal menentukan ramalan prestasi berbanding analisa melalui rekod-rekod yang sebilangan. Fasa pengurangan dimensi adalah perlu bagi mengawal kakangan “penentuan kedimensian” dataset mikrotatasusunan, sementara itu ia memantapkan lagi keberkesanan kenal pasti kanser. Kajian ini mencadangkan rangka kombinasi kaedah pengurangan dimensi dan algoritma pembelajaran mesin bagi mencapai prestasi ramalan kanser terbaik dengan menggunakan pelbagai dataset mikrotatasusunan. Dalam fasa pengurangan dimensi, kombinasi pemilihan ciri dan teknik pengekstrakan ciri telah dicadangkan, Pengoptimuman Pearson dan Koloni Semut bagi memilih gen yang paling penting, Analisis Komponen Prinsipal dan Analisis Komponen Prinsipal Kernel, bagi menukar gen terpilih yang linear dan tak linear kepada ruang baru yang dikurangkan. Dalam menentukan fasa mengenal pasti kanser, kajian ini mencadangkan empat algoritma iaitu C5.0, Regresi Logistik, Rangkaian Neural Buatan dan Mesin Vektor Sokongan. Dapatan kajian menunjukkan rangka ini adalah berkesan dan kompetitif berbanding kaedah semasa.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.