Cancer research is a challenging and competitive field. The study of gene expression data has enabled the discovery of unknown types of cancer using unsupervised learning. However, genomic sequence data are increasing in an exponential manner. Indeed, since 2011 the global annual sequencing capacity is estimated to be quadrillions of bases and counting. To cope with this issue, we propose, in this paper, the implementation of differential evolution clustering algorithm using MapReduce methodology in order to deal with big data. The proposed algorithm consists in three consecutive levels. Experiments were conducted on 18 real gene expression data sets. The obtained results have shown that our approach is effective and competes with existing algorithms.
The high-throughput sequencing technologies have produced a wealth of epigenetics data. These datasets require stand-alone techniques to extract useful insights which can be used for further analysis. One tailored technique is data clustering; it is a primary method to extract the first layer of information from unlabeled data sets. However, epigenetics data sets are very large making conventional data clustering techniques inappropriate. By another way, Swarm Intelligence (SI) algorithms such as Ant Colony Optimization (ACO), Artificial Bee Colony (ABC) and Particle Swarm Optimization (PSO) have shown promising results when applied to data of moderate size. They exhibit different capabilities making their cooperation a promising alternative to achieve good quality clustering. In this paper, a parallel and distributed generalized island model (GIM) based on these SI algorithms is developed according to MapReduce framework. The proposed framework (MRC-GIM) allows cooperation between the three SI algorithms to achieve largely scalable data partitioning. MRC-GIM has been validated on Amazon Elastic MapReduce service (EMR) deploying up to 192 computer nodes and 30 gigabytes of data. The experiments show that MRC-GIM competes and often outperforms
One of the remarkable results of the rapid advances in information technology is the production of tremendous amounts of data sets, so large or complex that available processing methods are inadequate, among these methods cluster analysis. Clustering becomes more challenging and complex. In this paper, the authors describe a highly scalable Differential Evolution (DE) algorithm based on map-reduce programming model. The traditional use of DE to deal with clustering of large sets of data is so time-consuming that it is not feasible. On the other hand, map-reduce is a programming model emerged lately to allow the design of parallel and distributed approaches. In this paper, four stages map-reduce differential evolution algorithm termed as DE-MRC is presented; each of these four phases is a map-reduce process and dedicated to a particular DE operation. DE-MRC has been tested on a real parallel platform of 128 computers connected with each other and more than 30 GB of data. Experimental results show the high scalability and robustness of DE-MRC.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.