Data mining is one of the main activities in bioinformatics, specifically to extract knowledge from massive data sets related with gene expression measurement, CNV, DNA strings, and others. A long array of methods are used to perform such task, ranging from the more established parametric statistical analysis to non parametric techniques, to classification methods that have been developed in knowledge engineering and artificial intelligence. In this paper, we consider a method for extracting logic formulas from data that relies on a large body of literature in integer and logic optimization, originally presented in [1], that has been largely and successfully applied to different problems in bioinformatics ([2], [3], [4], [5], [6]). Such method is based on the iterative solution of Minimum Cost SAT Problems and is able to extract logic formulas in DNF form that possess interesting features for their interpretation. While leaving the discussion of the main features and motivations of this approach to the related literature, in this talk we focus on the problem of solving efficiently very large scale instances of this well known logic programming problem and propose a new GRASP approach that, being able to exploit the specific structure of the problem, largely outperforms other established solvers for the same problem.
References
[1] G. Felici, K. Truemper. A Minsat Approach for Learning in Logic Domains, INFORMS Journal on Computing 14(1): 20-36 (2002).
[2] P. Bertolazzi, G. Felici, E. Weitschek. Learning to classify species with barcodes, BMC Bioinformatics, 10:1-12 (2009).
[3] M. Arisi, R. D’Onofrio, A. Brandi, S. Felsani, G. Capsoni, G. Drovandi, G. Felici, E. Weitschek, P. Bertolazzi, A. Cattaneo. Gene Expression Biomarkers in the Brain of a Mouse Model for Alzheimer’s Disease: Mining of Microarray Data by Logic Classification and Feature Selection. Journal of Alzheimer's Disease, 24(4) 721-738 (2011).
[4] E. Weitschek, A. Lo Presti, G. Drovandi, G. Felici, M. Ciccozzi, M. Ciotti, P. Bertolazzi. Human polyomaviruses identification by logic mining techniques. BMC Virology Journal, 9:58 (2012).
[5] E. Weitschek, G. Fiscon, G. Felici. Supervised DNA Barcodes species classification: analysis, comparisons and results, BMC BioData Mining, 7:4 (2014).
[6] P. Bertolazzi, G. Felici, P. Festa, G. Fiscon, E. Weitschek. Integer Programming models for Feature Selection: new extensions and a randomized solution algorithm, European Journal of Operational Research, 250-389–399, 250 (2016).