Data mining (DM) is increasingly used in the analysis of data generated in life sciences, including biological data produced in several disciplines such as genomics and proteomics, medical data produced in clinical practice, and administrative data produced in health care. The difficulty in mining such data is twofold. First of all, data in life sciences are inherently heterogeneous, spanning from molecular level data to clinical and administrative data. Second, data in life sciences are produced at an increasing rate and data repositories are becoming very large. Thus, the management and analysis of such data is becoming a main bottleneck in biomedical research. The main goal of this paper is to review the main methodologies to mine life sciences data and the ways they are coupled to high‐performance infrastructures and systems that result in an efficient analysis. This paper recalls basic concepts of DM, grids, and distributed DM on grids, and reviews main approaches to mine biomedical data on high‐performance infrastructures with special focus on the analysis of genomics, proteomics, and interactomics data, and the exploration of magnetic resonance images in neurosciences. The paper can be of interest both to bioinformaticians, who can learn how to exploit high performance infrastructures to mine life sciences data, and to computer scientists, who can address the heterogeneity and the high volumes of life sciences data at the data management, algorithm, and user interface layers. © 2013 Wiley Periodicals, Inc.This article is categorized under:
Algorithmic Development > Biological Data Mining
Application Areas > Data Mining Software Tools
Application Areas > Health Care
Technologies > Computer Architectures for Data Mining