Abstract:We tested our proposed method on two well-known yeast microarray data sets. Our implementation mined the data effectively and discovered patterns of co-regulated genes that are hidden to traditional APD methods. The high content of biologically relevant information in these patterns is demonstrated by the significant enrichment of co-regulated genes with similar functions. Our experimental results show that the Mining Attribute Profile (MAP) method is an efficient tool for the analysis of gene expression data … Show more
“…The main reason is that features which have been labeled ''redundancy'' may not be real redundancy. For example, in microarray data analysis, genes normally function in gene groups [36][37][38] in which genes are highly correlated and each gene cannot function apart from one another. Therefore, the traditional criteria are unsuitable for such applications.…”
“…The main reason is that features which have been labeled ''redundancy'' may not be real redundancy. For example, in microarray data analysis, genes normally function in gene groups [36][37][38] in which genes are highly correlated and each gene cannot function apart from one another. Therefore, the traditional criteria are unsuitable for such applications.…”
“…Searching groups of similar features is usually done with the help of various clustering techniques, frequently specially tailored to a task at hand. See [Smyth et al (2003), Hastie et al (2001), Saeys et al (2007), Gyenesei, A. et al (2007)] and the literature there.…”
Applications of machine learning techniques in Life Sciences are the main applications forcing a paradigm shift in the way these techniques are used. Rather than obtaining the best possible supervised classifier, the Life Scientist needs to know which features contribute best to classifying observations into distinct classes and what are the interdependencies between the features. To this end we significantly extend our earlier work [Dramiński et al. (2008)] that introduced an effective and reliable method for ranking features according to their importance for classification. We begin with adding a method for finding a cut-off between informative and non-informative features and then continue with a development of a methodology and an implementation of a procedure for determining interdependencies between informative features. The reliability of our approach rests on multiple construction of tree classifiers. Essentially, each classifier is trained on a randomly chosen subset of the original data using only a fraction of all of the observed features. This approach is conceptually simple yet computer-intensive. The methodology is validated on a large and difficult task of modelling HIV-1 reverse transcriptase resistance to drugs which is a good example of the aforementioned paradigm shift. We construct a classifier but of the main interest is the identification of mutation points (i.e. features) and their combinations that model drug resistance.
“…Real data come from various biological studies previously used as reference data in biclustering research [25–28]. For the comparison of the computational efficiency, all biological data sets were binarized.…”
During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.