Angela Lopez-del Rio scite author profile

Angela Lopez-del Rio

3Publications

74Citation Statements Received

183Citation Statements Given

How they've been cited

How they cite others

Affiliations

Hospital Sant Joan de Déu Barcelona, Universitat Politècnica de Catalunya, Biomedical Research Networking Center in Bioengineering, Biomaterials and Nanomedicine

Publications

Order By: Most citations

Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning

Rio

Nonell-Canals²,

Vidal³

et al. 2019

J. Chem. Inf. Model.

View full text Add to dashboard Cite

Binding prediction between targets and drug-like compounds through Deep Neural Networks have generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed. In this work, we explored how di↵erent cross-validation strategies applied to data from di↵erent molecular databases a↵ect to the performance of binding prediction proteochemometrics models. These strategies are: (1) random splitting, (2) splitting based on K-means clustering (both 1 of actives and inactives), (3) splitting based on source database and (4) splitting based both in the clustering and in the source database. These schemas are applied to a Deep Learning proteochemometrics model and to a simple logistic regression model to be used as baseline. Additionally, two di↵erent ways of describing molecules in the model are tested: (1) by their SMILES and (2) by three fingerprints. The classification performance of our Deep Learning-based proteochemometrics model is comparable to the state of the art. Our results show that the lack of generalization of these models is due to a bias in public molecular databases and that a restrictive cross-validation schema based on compounds clustering leads to worse but more robust and credible results. Our results also show better performance when representing molecules by their fingerprints.

show abstract

Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning

Rio

Nonell-Canals²,

Vidal³

et al. 2018

Preprint

View full text Add to dashboard Cite

Binding prediction between targets and drug-like compounds through Deep Neural Networks have generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed. In this work, we explored how different cross-validation strategies applied to data from different molecular databases affect to the performance of binding prediction proteochemometrics models. These strategies are: (1) random splitting, (2) splitting based on K-means clustering (both of actives and inactives), (3) splitting based on source database and (4) splitting based both in the clustering and in the source database. These schemas are applied to a Deep Learning proteochemometrics model and to a simple logistic regression model to be used as baseline. Additionally, two different ways of describing molecules in the model are tested: (1) by their SMILES and (2) by three fingerprints. The classification performance of our Deep Learning-based proteochemometrics model is comparable to the state of the art. Our results show that the lack of generalization of these models is due to a bias in public molecular databases and that a restrictive cross-validation schema based on compounds clustering leads to worse but more robust and credible results. Our results also show better performance when representing molecules by their fingerprints.

show abstract

Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction

Rio

Martin

Perera-Lluna

et al. 2020

Sci Rep

View full text Add to dashboard Cite

The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We propose and implement four novel types of padding the amino acid sequences. Then, we analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Results show that padding has an effect on model performance even when there are convolutional layers implied. Contrastingly to most of deep learning works which focus mainly on architectures, this study highlights the relevance of the deemed-of-lowimportance process of padding and raises awareness of the need to refine it for better performance. The code of this analysis is publicly available at https ://githu b.com/b2sla b/paddi ng_bench mark. Since the breakthrough of deep learning (DL) 1 , deep neural networks are being successfully applied in computational biology 2,3. This is due to their capacity for automatically extracting meaningful features from raw data 4. Specifically, DL is useful in the context of biological sequences, such as proteins or RNA, because it can learn directly from the sequence and hence, capture nonlinear dependencies and interaction effects. Some examples of applications of DL on biological sequences include prediction of specifities of DNA and RNA binding proteins 5 , DNA function quantification 6 , de novo peptide design 7 , detection of conserved DNA fragments 8 , prediction of protein associated GO terms 9 or quantification of the impact of genetic variation on gene regulatory mechanisms 3. The specific DL architectures able to leverage the inner structure of sequential biological data are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CNNs entail translational invariance 10 and can be used to find relevant patterns with biological meaning 5,8,11,12. For their part, bidirectional RNNs (and the derived Long Short-Term Memory and Gated Recurrent Units) are appropiate for modelling biological sequences since they are suited for data with a sequential but non-causal structure, variable length, and long-range dependencies 13-16. Both architectures are usually combined, as in DEEPre 17 , where a CNN-RNN model performs a hierarchical classification of enzymes. Proteins are long linear sequences constituted by amino acid residues attached covalently. These amino acid residues are represented by letters that cannot be directly processed by the mathematical operations used by DL models. Choosing how to digitally encode amino acids is a crucial step in this context, since it can affect to the overall performance of the models 18. A comprehensive review and a...

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.