Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning

Rio, Angela Lopez-del; Nonell-Canals, Alfons; Vidal, David; Perera-Lluna, Alexandre

doi:10.26434/chemrxiv.7133885

Cited by 6 publications

(21 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the recent successes of machinelearning based methods, there has been renewed interest in controlling for the biases in available datasets. [31][32][33][34] Sieg et al 31 report that ML-based methods tend to fit to the initial biases of their training data and report on the importance of domain biases. Chen et al 34 show that there are numerous biases still present in the DUD-E dataset, 35 and that ligandonly models achieve comparable performance to 3D CNNs on DUD-E, despite the lack of receptor to inform the model's predictions.…”

Section: Introductionmentioning

confidence: 99%

“…Chen et al 34 show that there are numerous biases still present in the DUD-E dataset, 35 and that ligandonly models achieve comparable performance to 3D CNNs on DUD-E, despite the lack of receptor to inform the model's predictions. Lastly, Lopez-del Rio et al 32 advocate for utilizing clustered cross-validation (CCV) based splits for training, as random splitting is over-optimistic and does not measure the ability of a model to generalize to a new target class, which is highly desirable in a structure-based model.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Three-Dimensional Convolutional Neural Networks and a Cross-Docked Data Set for Structure-Based Drug Design

Francoeur

Masuda

Sunseri

et al. 2020

J. Chem. Inf. Model.

180

253

View full text Add to dashboard Cite

One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard dataset of sufficient size to compare performance between models. We present a new dataset for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank and perform a comprehensive evaluation of grid-based convolutional neural network models on this dataset. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind dataset, how performance improves by adding more, lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of 5 densely connected convolutional newtworks, achieves a root mean squared error of 1.42 and Pearson R of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized dataset for training machine learning models to recognize ligands in non-cognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this dataset for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models. File list (2) download file view on ChemRxiv crossdocked2020.pdf (3.80 MiB) download file view on ChemRxiv crossdocked2020_supplement.pdf (0.92 MiB)

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Three-Dimensional Convolutional Neural Networks and a Cross-Docked Data Set for Structure-Based Drug Design

Francoeur

Masuda

Sunseri

et al. 2020

J. Chem. Inf. Model.

180

253

View full text Add to dashboard Cite

show abstract

“…The analogy between text and proteins, understood as sequences of characters with a meaning, has motivated the application of Natural Language Processing (NLP) techniques to amino acid sequences. Along these lines, machine-learning derived embeddings 23 – 26 and one-hot encoding 7 , 9 , 12 , 14 , 17 , 27 have become very popular. Specifically, the latter method has been widely used in protein-based DL models since neural networks are able to extract features from raw data.…”

Section: Introductionmentioning

confidence: 99%

“…The main problem of one-hot encoding is that each protein has a different length, while all input vectors should be of the same size to be fed into the model. To overcome this issue, sequence padding and truncation are usually applied 7 , 9 , 12 , 14 . This means establishing a common length for all proteins and then, truncating longer proteins to that length or filling shorter proteins with an “artificial” character up until that length (see Fig.…”

Section: Introductionmentioning

confidence: 99%

“…Padding zeros can be added at any position of the sequence, for example at the N- and C- terminals of the sequences 28 . In practice, they are usually added at the end 7 , 14 . However, details on the concrete steps of sequences padding are often omitted as they are deemed of low importance for the results of the study 9 , 12 , 17 , 27 .…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction

Rio

Martin

Perera-Lluna

et al. 2020

Sci Rep

Self Cite

View full text Add to dashboard Cite

The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We propose and implement four novel types of padding the amino acid sequences. Then, we analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Results show that padding has an effect on model performance even when there are convolutional layers implied. Contrastingly to most of deep learning works which focus mainly on architectures, this study highlights the relevance of the deemed-of-lowimportance process of padding and raises awareness of the need to refine it for better performance. The code of this analysis is publicly available at https ://githu b.com/b2sla b/paddi ng_bench mark. Since the breakthrough of deep learning (DL) 1 , deep neural networks are being successfully applied in computational biology 2,3. This is due to their capacity for automatically extracting meaningful features from raw data 4. Specifically, DL is useful in the context of biological sequences, such as proteins or RNA, because it can learn directly from the sequence and hence, capture nonlinear dependencies and interaction effects. Some examples of applications of DL on biological sequences include prediction of specifities of DNA and RNA binding proteins 5 , DNA function quantification 6 , de novo peptide design 7 , detection of conserved DNA fragments 8 , prediction of protein associated GO terms 9 or quantification of the impact of genetic variation on gene regulatory mechanisms 3. The specific DL architectures able to leverage the inner structure of sequential biological data are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CNNs entail translational invariance 10 and can be used to find relevant patterns with biological meaning 5,8,11,12. For their part, bidirectional RNNs (and the derived Long Short-Term Memory and Gated Recurrent Units) are appropiate for modelling biological sequences since they are suited for data with a sequential but non-causal structure, variable length, and long-range dependencies 13-16. Both architectures are usually combined, as in DEEPre 17 , where a CNN-RNN model performs a hierarchical classification of enzymes. Proteins are long linear sequences constituted by amino acid residues attached covalently. These amino acid residues are represented by letters that cannot be directly processed by the mathematical operations used by DL models. Choosing how to digitally encode amino acids is a crucial step in this context, since it can affect to the overall performance of the models 18. A comprehensive review and a...

show abstract

iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules

Gori

Llanos

Bellera

et al. 2022

J. Chem. Inf. Model.

View full text Add to dashboard Cite

The clustering of small molecules implies the organization of a group of chemical structures into smaller subgroups with similar features. Clustering has important applications to sample chemical datasets or libraries in a representative manner (e.g., to choose, from a virtual screening hit list, a chemically diverse subset of compounds to be submitted to experimental confirmation, or to split datasets into representative training and validation sets when implementing machine learning models). Most strategies for clustering molecules are based on molecular fingerprints and hierarchical clustering algorithms. Here, two open-source in-house methodologies for clustering of small molecules are presented: iterative Random subspace Principal Component Analysis clustering (iRaPCA), an iterative approach based on feature bagging, dimensionality reduction, and K-means optimization; and Silhouette Optimized Molecular Clustering (SOMoC), which combines molecular fingerprints with the Uniform Manifold Approximation and Projection (UMAP) and Gaussian Mixture Model algorithm (GMM). In a benchmarking exercise, the performance of both clustering methods has been examined across 29 datasets containing between 100 and 5000 small molecules, comparing these results with those given by two other well-known clustering methods, Ward and Butina. iRaPCA and SOMoC consistently showed the best performance across these 29 datasets, both in terms of within-cluster and between-cluster distances. Both iRaPCA and SOMoC have been implemented as free Web Apps and standalone applications, to allow their use to a wide audience within the scientific community.

show abstract

Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning

Cited by 6 publications

References 0 publications

Three-Dimensional Convolutional Neural Networks and a Cross-Docked Data Set for Structure-Based Drug Design

Three-Dimensional Convolutional Neural Networks and a Cross-Docked Data Set for Structure-Based Drug Design

Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction

iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules

Contact Info

Product

Resources

About