This study demonstrates the importance of obtaining statistically stable results when using machine learning methods to predict the activity of antimicrobial peptides, due to the cost and complexity of the chemical processes involved in cases where datasets are particularly small (less than a few hundred instances). Like in other fields with similar problems, this results in large variability in the performance of predictive models, hindering any attempt to transfer them to lab practice. Rather than targeting good peak performance obtained from very particular experimental setups, as reported in related literature, we focused on characterizing the behavior of the machine learning methods, as a preliminary step to obtain reproducible results across experimental setups, and, ultimately, good performance. We propose a methodology that integrates feature learning (autoencoders) and selection methods (genetic algorithms) thorough the exhaustive use of performance metrics (permutation tests and bootstrapping), which provide stronger statistical evidence to support investment decisions with the lab resources at hand. We show evidence for the usefulness of 1) the extensive use of computational resources, and 2) adopting a wider range of metrics than those reported in the literature to assess method performance. This approach allowed us to guide our quest for finding suitable machine learning methods, and to obtain results comparable to those in the literature with strong statistical stability.Keywords: antimicrobial peptides; learning curves; machine learning; statistical stability; support vector regression. * Universidad Industrial de Santander (Bucaramanga-Santander, Colombia). francy.camacho1@correo.uis.edu.co. ** Universidad Industrial de Santander (Bucaramanga-Santander, Colombia). rodrigo.torres@ecopetrol.com.co. *** Universidad Industrial de Santander (Bucaramanga-Santander, Colombia). rramosp@uis.edu.co. Assessing the behavior of machine learning methods to predict the activity of antimicrobial peptides
ResumenEste trabajo demuestra la importancia de obtener resultados estadísticamente estables cuando se emplean métodos de aprendizaje computacional para predecir la actividad de péptidos antimicrobianos donde, debido al costo y la complejidad de los procesos químicos, los conjuntos de datos son particularmente pequeños (menos de unos cientos de instancias). Al igual que en otros campos con problemas similares, esto produce grandes variabilidades en el rendimiento de los modelos predictivos, lo que dificulta cualquier intento por transferirlos a la práctica. Por ello, a diferencia de otros trabajos que reportan rendimientos predictivos máximos obtenidos en configuraciones experimentales muy particulares, nos enfocamos en caracterizar el comportamiento de los métodos de aprendizaje de máquina, como paso previo a obtener resultados reproducibles, estadísticamente estables y, finalmente, con una capacidad predictiva competitiva. Para este propósito se diseñó una metodología que integra el aprendizaje de cara...