2020
DOI: 10.1101/2020.05.21.107748
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Adding stochastic negative examples into machine learning improves molecular bioactivity prediction

Abstract: Multitask deep neural networks learn to predict ligand-target binding by example, yet public pharmacological datasets are sparse, imbalanced, and approximate. We constructed two hold-out benchmarks to approximate temporal and drug-screening test scenarios whose characteristics differ from a random split of conventional training datasets. We developed a pharmacological dataset augmentation procedure, Stochastic Negative Addition (SNA), that randomly assigns untested molecule-target pairs as transient negative e… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
6
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 44 publications
(28 reference statements)
0
6
0
Order By: Relevance
“…We used cross-validation as a comparative procedure to compare the performance of the CATS and Morgan fingerprints, both of which incurred the same amount of data loss, and we did not attempt to estimate the true prospective performance rate. Furthermore, optimal performance is reached when subsampling the negative class (as we did) rather than using as much negative data as available. , …”
Section: Discussionmentioning
confidence: 92%
See 2 more Smart Citations
“…We used cross-validation as a comparative procedure to compare the performance of the CATS and Morgan fingerprints, both of which incurred the same amount of data loss, and we did not attempt to estimate the true prospective performance rate. Furthermore, optimal performance is reached when subsampling the negative class (as we did) rather than using as much negative data as available. , …”
Section: Discussionmentioning
confidence: 92%
“…Furthermore, optimal performance is reached when subsampling the negative class (as we did) rather than using as much negative data as available. 29,30 Monte Carlo cross-validation differs from k-fold crossvalidation in that the test set in each case is sampled at random (without replacement) from the entire data set, as opposed to being randomly partitioned once. In the limit of infinite evaluations, this guarantees that all instances will eventually be evaluated, and with different combinations of the test data set, which reduces variance compared with k-fold cross-validation.…”
Section: ■ Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…In contrast, long FPs cause compute performance and storage issues and may produce overfitted ML models, a problem known as the curse of dimensionality. 84 Several authors explored different MFP and IFP lengths, 7, 59, 60, 71, 8595 however there is no single consensus on length, as results vary with the problem and data sets.…”
Section: Discussionmentioning
confidence: 99%
“…88%) of the feasible chemical space is explored in 33% of the computational time (Table 1). 37 and materials discovery [38][39] ). However, the overhead of up-front data generation is overshadowed by the concern of transferability.…”
mentioning
confidence: 99%