2021
DOI: 10.21203/rs.3.rs-745668/v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Extracting Predictive Representations from Hundreds of Millions of Molecules

Abstract: Although deep learning can automatically extract features in relatively simple tasks such as image analysis, the construction of appropriate representations remains essential for molecular predictions due to intricate molecular complexity. Additionally, it is often expensive, time-consuming, and ethically constrained to generate labeled data for supervised learning in molecular sciences, leading to challenging small and diverse datasets. In this work, we develop a self-supervised learning approach via a maskin… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
11
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 7 publications
(11 citation statements)
references
References 23 publications
0
11
0
Order By: Relevance
“…The reason behind this phenomenon is that some atoms of the interacting system (the parts that are in contact or close to each other) play a central role for the concerned properties, and the attention mechanism of the transformer is likely to help to capture the important features. Moreover, due to the good parallax of the transformer models, the self-supervised learning strategy has been widely used to pre-train these models, such as the work 46 in the biochemistry and the work 47 in biology. Third, in our proposed pressure adaptive mechanism, we introduce an adapted tensor to balance the contribution weights of global and local features at different pressures.…”
Section: ■ Discussionmentioning
confidence: 99%
“…The reason behind this phenomenon is that some atoms of the interacting system (the parts that are in contact or close to each other) play a central role for the concerned properties, and the attention mechanism of the transformer is likely to help to capture the important features. Moreover, due to the good parallax of the transformer models, the self-supervised learning strategy has been widely used to pre-train these models, such as the work 46 in the biochemistry and the work 47 in biology. Third, in our proposed pressure adaptive mechanism, we introduce an adapted tensor to balance the contribution weights of global and local features at different pressures.…”
Section: ■ Discussionmentioning
confidence: 99%
“…Inspired by Chen et al, an experiment is designed to explore the relationship between drug features learned for a specific target (such as DDIE) and molecular properties with general semantics. It is known that 21 targets of the Directory of Useful Decoys (DUD) 54,55 and 17 targets of the Maximum Unbiased Validation (MUV) datasets 55,56 are commonly used in drug discovery community. Therefore, we generate a new dataset, namely, DUD&MUV1847, by matching the targets and ligands between DUD&MUV and D571M used in our paper.…”
Section: Case Study On Multiclassification Tasksmentioning
confidence: 99%
“…14,21−24 The FreeSolv database, 7 against several ML models. 17,25,26 However, the size of this data set is several orders of magnitude smaller than many of the training data sets used in deep learning: a minimum of several thousand training points is recommended for models such as GNNs to avoid overfitting. 27 As a test set, the small size of FreeSolv may make it difficult to distinguish between the predictive accuracy of differing models, and several models have now approached test errors similar to the experimental uncertainty of 0.6 kcal mol −1 .…”
Section: Introductionmentioning
confidence: 99%
“…27 As a test set, the small size of FreeSolv may make it difficult to distinguish between the predictive accuracy of differing models, and several models have now approached test errors similar to the experimental uncertainty of 0.6 kcal mol −1 . 22,25,26 Ward et al 28 circumvented the problem of data scarcity by calculating B3LYP/6-31G(2df,p) solvation Gibbs free energies using the SMD continuum model 29 for molecules in the QM9 data set and trained a GNN to predict these values with an average error of 0.5 kcal mol −1 on the test data. Similarly, Zhang et al 22 computed aqueous solvation Gibbs free energies for over 100,000 organic compounds using SMD and trained a GNN to predict these values with an error of 0.4 kcal mol −1 .…”
Section: Introductionmentioning
confidence: 99%