2022
DOI: 10.26434/chemrxiv-2022-0hl5p-v2
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SolvBERT for solvation free energy and solubility prediction: a demonstration of an NLP model for predicting the properties of molecular complexes

Abstract: Deep learning models based on NLP, mainly the Transformer family, have been successfully applied to solve many chemistry-related problems, but their applications are mostly limited to chemical reactions. Meanwhile, solvation is an important concept in physical and organic chemistry, describing the interaction of solutes and solvents. This interaction leads to a solvation complex, a molecular complex similar to a reactant-reagent complex. In this study, we introduced the SolvBERT model, which reads the solute a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
10
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
4
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(11 citation statements)
references
References 27 publications
1
10
0
Order By: Relevance
“…The code supporting the finding of this study has been deposited at figshare 43 (DOI: 10.6084/m9.figshare.21269853) and will be publicly accessible once the paper is accepted. All code required for SolvBERT and TMAP, as well as repeating data preprocessing, is included in "solv-bert" folder.…”
Section: Figure 1 Smiles Representation Of Solvation Complexesmentioning
confidence: 97%
See 1 more Smart Citation
“…The code supporting the finding of this study has been deposited at figshare 43 (DOI: 10.6084/m9.figshare.21269853) and will be publicly accessible once the paper is accepted. All code required for SolvBERT and TMAP, as well as repeating data preprocessing, is included in "solv-bert" folder.…”
Section: Figure 1 Smiles Representation Of Solvation Complexesmentioning
confidence: 97%
“…The authors declare that the main data supporting the finding of this study are available within the article. All the supporting data have been deposited at figshare 43 (DOI: 10.6084/m9.figshare.21269853) and will be publicly accessible once the paper is accepted. The supporting data are in the "data" folder, while the data used for training are placed in the "training_files" folder.…”
Section: Figure 1 Smiles Representation Of Solvation Complexesmentioning
confidence: 99%
“…A new masked language model (MLM) is employed, so that deep bidirectional language representations can be created. In this work, we use the BERT model architecture rxnfp built by Schwaller et al 45 , which has been adapted for chemical reaction yield prediction 46 and molecular property prediction 32 .…”
Section: Bidirectional Encoder Representation From Transformers (Bert)mentioning
confidence: 99%
“…First, although recognized public small molecule databases such as PubChem 26 , ChemBL 27,28 , DSSTox 29 , MoleculeNet 30 , and ZINC 31 have adopted the canonical SMILES as one of the representations of molecular 2D structure, there is no such general molecular representation for metalloporphyrins which makes it more difficult to index, merge, read, and process the data. Furthermore, although one of our previous works showed the possibility of using deep learning models to predict the properties of molecular complexes, such as solute-solvent pairs 32 , metalloporphyrins have significantly larger structures than "drug-like" small molecule complexes and contain additional inorganic components (i.e. center metal ions), which may increase the difficulty for deep learning.…”
Section: Introductionmentioning
confidence: 99%
“…[36][37][38] For more rapid and accurate solubility predictions, various predictive models have been actively developed by analyzing quantitative structure-property relationship (QSPR) 34,[39][40][41][42][43][44] or adopting machine learning (ML) techniques. 34,42,[45][46][47][48][49][50][51][52][53][54][55] Particularly, current advanced ML models used graph neural networks (GNNs) combined with interaction layers 47,53 recurrent neural networks with attention layers, 45 and natural language processing-based transformers. 54 These models achieved accuracies close to experimental uncertainties.…”
Section: Introductionmentioning
confidence: 99%