Feature selection methods and genomic big data: a systematic review

Tadist, Khawla; Arivazhagan, S.; Nikolov, Nikola S.; Mrabti, Fatiha; Zahi, Azeddine

doi:10.1186/s40537-019-0241-0

Cited by 106 publications

(58 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The amount of data generated by high-throughput sequencing technologies 115 represents a challenge in genomic prediction, particularly due to the difficulty of working with high-dimensional datasets, i.e., the 'large p, small n' problem 116 . This increase in the amount of available information makes the task of directly applying these marker data in genomic analyses more difficult and necessitates appropriate preprocessing steps 117 . In this study, we proposed the use of FS techniques to select a smaller set of SNPs with more predictive power than the entire dataset and closer associations with the brown rust phenotype to assist the identification of regions associated with disease status.…”

Section: Discussionmentioning

confidence: 99%

Machine learning approaches reveal genomic regions associated with sugarcane brown rust resistance

Aono

Costa

Rody

et al. 2020

Preprint

View full text Add to dashboard Cite

Sugarcane is an economically important crop, but its genomic complexity has hindered advances in molecular approaches for genetic breeding. New cultivars are released based on the identification of interesting traits, and for sugarcane, brown rust resistance is a desirable characteristic due to the large economic impact of the disease. Although marker-assisted selection for rust resistance has been successful, the genes involved are still unknown, and the associated regions vary among cultivars, thus restricting methodological generalization. We used genotyping by sequencing of full-sib progeny to relate genomic regions with brown rust phenotypes. We established a pipeline to identify reliable SNPs in complex polyploid data, which were used for phenotypic prediction via machine learning. We identified 14,540 SNPs, which led to a mean prediction accuracy of 50% by using different models. We also tested feature selection algorithms to increase predictive accuracy, resulting in a reduced dataset with more explanatory power for rust phenotypes. Using different feature selection techniques, we achieved accuracy of up to 95% with a dataset of 131 SNPs related to brown rust QTL regions and auxiliary genes. Therefore, our novel strategy has the potential to assist studies of the genomic organization of brown rust resistance in sugarcane. 3/154/15 10/15

show abstract

Section: Discussionmentioning

confidence: 99%

Machine learning approaches reveal genomic regions associated with sugarcane brown rust resistance

Aono

Costa

Rody

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Feature selection (filter, wrapper and embedded) [7][8][9] and feature extraction [10][11][12] (supervised and unsupervised) are dimensionality reduction approaches that have been established, these approaches have overcome several problems such as performance enhancement, yet there is need for improvements hybrid model and optimization for getting better results [13]. Finding an optimal subset of genes proficient at handling high dimension optimization difficulties with reasonable solutions is required [5].…”

Section: Introductionmentioning

confidence: 99%

Optimized hybrid investigative based dimensionality reduction methods for malaria vector using KNN classifier

Arowolo

Adebiyi

et al. 2021

J Big Data

View full text Add to dashboard Cite

RNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is capable of adding to prevailing machine learning methods.

show abstract

“…Whenever the needed number of training examples cannot be provided, reducing features decreases the size of the needed training examples and hence increases the overall yield shape of the classification algorithm. In the previous years, two methods for dimensional reduction were presented: feature selection and feature extraction [4,5]. Feature selection (FS) seeks for a relevant subset of existing features, while features are designed for a new space of lower dimensionality in the feature extraction method.…”

Section: Introductionmentioning

confidence: 99%

A Novel Community Detection Based Genetic Algorithm for Feature Selection

Rostami

Berahmand

Forouzandeh

2020

Preprint

View full text Add to dashboard Cite

The selection of features is an essential data preprocessing stage in data mining. The core principle of feature selection seems to be to pick a subset of possible features by excluding features with almost no predictive information as well as highly associated redundant features. In the past several years, a variety of meta-heuristic methods were introduced to eliminate redundant and irrelevant features as much as possible from high-dimensional datasets. Among the main disadvantages of present meta-heuristic based approaches is that they are often neglecting the correlation between a set of selected features. In this article, for the purpose of feature selection, the authors propose a genetic algorithm based on community detection, which functions in three steps. The feature similarities are calculated in the first step. The features are classified by community detection algorithms into clusters throughout the second step. In the third step, features are picked by a genetic algorithm with a new community-based repair operation. Nine benchmark classification problems were analyzed in terms of the performance of the presented approach. Also, the authors have compared the efficiency of the proposed approach with the findings from four available algorithms for feature selection. The findings indicate that the new approach continuously yields improved classification accuracy.

show abstract

Feature selection methods and genomic big data: a systematic review

Cited by 106 publications

References 46 publications

Machine learning approaches reveal genomic regions associated with sugarcane brown rust resistance

Machine learning approaches reveal genomic regions associated with sugarcane brown rust resistance

Optimized hybrid investigative based dimensionality reduction methods for malaria vector using KNN classifier

A Novel Community Detection Based Genetic Algorithm for Feature Selection

Contact Info

Product

Resources

About