2018
DOI: 10.26434/chemrxiv.7133885
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning

Abstract: Binding prediction between targets and drug-like compounds through Deep Neural Networks have generated promising results in recent years, outperforming traditional machine learning-based methods. However, the generalization capability of these classification models is still an issue to be addressed. In this work, we explored how different cross-validation strategies applied to data from different molecular databases affect to the performance of binding prediction proteochemometrics models. These strategies are… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
21
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(21 citation statements)
references
References 0 publications
0
21
0
Order By: Relevance
“…With the recent successes of machinelearning based methods, there has been renewed interest in controlling for the biases in available datasets. [31][32][33][34] Sieg et al 31 report that ML-based methods tend to fit to the initial biases of their training data and report on the importance of domain biases. Chen et al 34 show that there are numerous biases still present in the DUD-E dataset, 35 and that ligandonly models achieve comparable performance to 3D CNNs on DUD-E, despite the lack of receptor to inform the model's predictions.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…With the recent successes of machinelearning based methods, there has been renewed interest in controlling for the biases in available datasets. [31][32][33][34] Sieg et al 31 report that ML-based methods tend to fit to the initial biases of their training data and report on the importance of domain biases. Chen et al 34 show that there are numerous biases still present in the DUD-E dataset, 35 and that ligandonly models achieve comparable performance to 3D CNNs on DUD-E, despite the lack of receptor to inform the model's predictions.…”
Section: Introductionmentioning
confidence: 99%
“…Chen et al 34 show that there are numerous biases still present in the DUD-E dataset, 35 and that ligandonly models achieve comparable performance to 3D CNNs on DUD-E, despite the lack of receptor to inform the model's predictions. Lastly, Lopez-del Rio et al 32 advocate for utilizing clustered cross-validation (CCV) based splits for training, as random splitting is over-optimistic and does not measure the ability of a model to generalize to a new target class, which is highly desirable in a structure-based model.…”
Section: Introductionmentioning
confidence: 99%
“…The analogy between text and proteins, understood as sequences of characters with a meaning, has motivated the application of Natural Language Processing (NLP) techniques to amino acid sequences. Along these lines, machine-learning derived embeddings 23 – 26 and one-hot encoding 7 , 9 , 12 , 14 , 17 , 27 have become very popular. Specifically, the latter method has been widely used in protein-based DL models since neural networks are able to extract features from raw data.…”
Section: Introductionmentioning
confidence: 99%
“…The main problem of one-hot encoding is that each protein has a different length, while all input vectors should be of the same size to be fed into the model. To overcome this issue, sequence padding and truncation are usually applied 7 , 9 , 12 , 14 . This means establishing a common length for all proteins and then, truncating longer proteins to that length or filling shorter proteins with an “artificial” character up until that length (see Fig.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation