Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms

Ortholog are genes in different species, evolving from a common ancestor. Ortholog detection is essential to study phylogenies and to predict the function of unknown genes. The scalability of gene (or protein) pairwise comparisons and that of the classification process constitutes a challenge due to the ever-increasing amount of sequenced genomes. Ortholog detection algorithms, just based on sequence similarity, tend to fail in classification, specifically, in Saccharomycete yeasts with rampant paralogies and gene losses. In this book chapter, a new classification approach has been proposed based on the combination of pairwise similarity measures in a decision system that consider the extreme imbalance between ortholog and non-ortholog pairs. Some new gene pair similarity measures are defined based on protein physicochemical profiles, gene pair membership to conserved regions in related genomes, and protein lengths. The efficiency and scalability of the calculation of these measures are analyzed to propose its implementation for big data. In conclusion, evaluated supervised algorithms that manage big and imbalanced data showed high effectiveness in Saccharomycete yeast genomes.

show abstract

“…The results of the Friedman test [39] for the AUC measure with the four datasets of S. cerevisiae -C.…”

Section: Resultsmentioning

confidence: 99%

Big Data Supervised Pairwise Ortholog Detection in Yeasts

Cañizares¹,

Barrio‐García²,

Herrera³

et al. 2017

Yeast - Industrial Applications

View full text Add to dashboard Cite

show abstract

“…The statistical analysis was performed using the Mann-Whitney U (Wilcoxon rank-sum test) test-two-sided test, provided by the R software environment for statistical computing (R Core Team, 2013). This test was chosen since, according to Trawiński et al (2012), it is more sensible than the t-test when the number of observations is small (10 in our case).…”

Section: 3mentioning

confidence: 99%

A differential evolution approach to dimensionality reduction for classification needs

Martinović

Bajer

Zorić

2014

International Journal of Applied Mathematics and Computer Science

View full text Add to dashboard Cite

The feature selection problem often occurs in pattern recognition and, more specifically, classification. Although these patterns could contain a large number of features, some of them could prove to be irrelevant, redundant or even detrimental to classification accuracy. Thus, it is important to remove these kinds of features, which in turn leads to problem dimensionality reduction and could eventually improve the classification accuracy. In this paper an approach to dimensionality reduction based on differential evolution which represents a wrapper and explores the solution space is presented. The solutions, subsets of the whole feature set, are evaluated using the k-nearest neighbour algorithm. High quality solutions found during execution of the differential evolution fill the archive. A final solution is obtained by conducting k-fold crossvalidation on the archive solutions and selecting the best one. Experimental analysis is conducted on several standard test sets. The classification accuracy of the k-nearest neighbour algorithm using the full feature set and the accuracy of the same algorithm using only the subset provided by the proposed approach and some other optimization algorithms which were used as wrappers are compared. The analysis shows that the proposed approach successfully determines good feature subsets which may increase the classification accuracy.

show abstract

“…Ramon and De Raedt (2000) adapted neural networks (Trawiński et al, 2012) to the MIL setting via taking into account the relation of a bag to its instances. Zhang and Zhou (2004) later derived a similar framework.…”

Section: Related Workmentioning

confidence: 99%

Multiple-instance learning with pairwise instance similarity

Yuan

Liu

Tang

2014

International Journal of Applied Mathematics and Computer Science

View full text Add to dashboard Cite

Multiple-Instance Learning (MIL) has attracted much attention of the machine learning community in recent years and many real-world applications have been successfully formulated as MIL problems. Over the past few years, several Instance Selection-based MIL (ISMIL) algorithms have been presented by using the concept of the embedding space. Although they delivered very promising performance, they often require long computation times for instance selection, leading to a low efficiency of the whole learning process. In this paper, we propose a simple and efficient ISMIL algorithm based on the similarity of pairwise instances within a bag. The basic idea is selecting from every training bag a pair of the most similar instances as instance prototypes and then mapping training bags into the embedding space that is constructed from all the instance prototypes. Thus, the MIL problem can be solved with the standard supervised learning techniques, such as support vector machines. Experiments show that the proposed algorithm is more efficient than its competitors and highly comparable with them in terms of classification accuracy. Moreover, the testing of noise sensitivity demonstrates that our MIL algorithm is very robust to labeling noise.

show abstract

Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms

Cited by 139 publications

References 44 publications

Big Data Supervised Pairwise Ortholog Detection in Yeasts

Big Data Supervised Pairwise Ortholog Detection in Yeasts

A differential evolution approach to dimensionality reduction for classification needs

Multiple-instance learning with pairwise instance similarity

Contact Info

Product

Resources

About