Perplexity-based molecule ranking and bias estimation of chemical language models

Moret, Michael; Grisoni, Francesca; Katzberger, Paul; Schneider, Gisbert

doi:10.26434/chemrxiv-2021-zv6f1-v2

“…Transfer learning 77 -applying a models' previously learned knowledge to a new, related problem by further training -was applied to the LSTM and transformer models in agreement with previous studies 64,66,78,79 . In a preliminary analysis, we explored transfer learning approaches for graph neural networks using self-supervision (context prediction 80 , infomax 81 , edge prediction 82 , and masking 80 ).…”

Section: Deep Learning Methodsmentioning

confidence: 85%

“…LSTMs -a type of recurrent neural network -can learn from string sequences by keeping track of long-range dependences. As in a previous study 64 , LSTM models were pre-trained on SMILES obtained by merging all training sets with no repetitions (36,281 molecules), using next-character prediction, before applying transfer learning for bioactivity prediction. 3.…”

Section: Smiles-based Deep Learning Methodsmentioning

confidence: 99%

“…Non-canonical SMILES strings were generated using RDKit 89 . (a) LSTM models were pre-trained on SMILES obtained by merging all training sets with no repetitions (36,281 molecules), using next-character prediction as in a recent study 64 . The network was composed of four layers comprising 5,820,515 parameters (layer 1: batch normalization; layer 2: LSTM with 1024 units; layer 3: LSTM with 256 units; layer 4: batch normalization).…”

Section: Data Curationmentioning

confidence: 99%

“…For all models, we optimized the learning rate (lr), lr = [5x10 -4 , 5x10 -5 , or 5x10 -6 ]. The following hyperparameters were optimized: (a) GCN, hidden atom features (h a ), number of convolutional layers (n c ), hidden multiset transformer nodes (h t ), hidden predictor features (h p ), , h a = [32,64,128,256,512], n c = [1,2,3,4,5] [32,64,128,256]. All models were trained for 300 epochs, using early-stopping with a patience of 10 epochs.…”

Section: Data Curationmentioning

confidence: 99%

See 2 more Smart Citations

Exposing the limitations of molecular machine learning with activity cliffs.

Tilborg

¹

,

Alenicheva²,

Grisoni

³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Machine learning has become a crucial tool in drug discovery and chemistry at large, e.g., to predict molecular properties, such as bioactivity, with high accuracy. However, activity cliffs – pairs of molecules that are highly similar in their structure but exhibit large differences in potency – have been underinvestigated for their effect on model performance. Not only are these edge cases informative for molecule discovery and optimization, but models that are well-equipped to accurately predict the potency of activity cliffs have an increased potential for prospective applications. Our work aims to fill the current knowledge gap on best-practice machine learning methods in the presence of activity cliffs. We benchmarked a total of 720 machine and deep learning models on curated bioactivity data from 30 macromolecular targets for their performance on activity cliff compounds. While all methods struggled in the presence of activity cliffs, machine learning approaches based on molecular descriptors outperformed more complex deep learning methods. Our findings highlight large case-by-case differences in performance, advocating for (a) the inclusion of dedicated “activity-cliff-centered” metrics during model development and evaluation, and (b) the development of novel algorithms to better predict the properties of activity cliffs. To this end, the methods, metrics, and results of this study have been encapsulated into an open-access benchmarking platform named MoleculeACE (Activity Cliff Estimation, available on GitHub at: https://github.com/molML/MoleculeACE). MoleculeACE is designed to steer the community towards addressing the pressing but overlooked limitation of molecular machine learning models posed by activity cliffs.

show abstract

“…LSTMs -a type of recurrent neural network -can learn from string sequences by keeping track of long-range dependences. As in a previous study 68 , LSTM models were pre-trained on SMILES obtained by merging all training sets with no repetitions (36,281 molecules), using next-character prediction, before applying transfer learning for bioactivity prediction. 3.…”

Section: Smiles-based Deep Learning Methodsmentioning

confidence: 99%

Exposing the limitations of molecular machine learning with activity cliffs.

Tilborg

¹

,

Alenicheva²,

Grisoni

³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Machine learning has become a crucial tool in drug discovery and chemistry at large, e.g., to predict molecular properties, such as bioactivity, with high accuracy. However, activity cliffs – pairs of molecules that are highly similar in their structure but exhibit large differences in potency – have been underinvestigated for their effect on model performance. Not only are these edge cases informative for molecule discovery and optimization, but models that are well-equipped to accurately predict the potency of activity cliffs have an increased potential for prospective applications. Our work aims to fill the current knowledge gap on best-practice machine learning methods in the presence of activity cliffs. We benchmarked a total of 720 machine and deep learning models on curated bioactivity data from 30 macromolecular targets for their performance on activity cliff compounds. While all methods struggled in the presence of activity cliffs, machine learning approaches based on molecular descriptors outperformed more complex deep learning methods. Our findings highlight large case-by-case differences in performance, advocating for (a) the inclusion of dedicated “activity-cliff-centered” metrics during model development and evaluation, and (b) the development of novel algorithms to better predict the properties of activity cliffs. To this end, the methods, metrics, and results of this study have been encapsulated into an open-access benchmarking platform named MoleculeACE (Activity Cliff Estimation, available on GitHub at: https://github.com/molML/MoleculeACE). MoleculeACE is designed to steer the community towards addressing the pressing but overlooked limitation of molecular machine learning models posed by activity cliffs.

show abstract

Exposing the Limitations of Molecular Machine Learning with Activity Cliffs

Tilborg

¹

,

Alenicheva²,

Grisoni

³

2022

J. Chem. Inf. Model.

Self Cite

View full text Add to dashboard Cite

Machine learning has become a crucial tool in drug discovery and chemistry at large, e.g., to predict molecular properties, such as bioactivity, with high accuracy. However, activity cliffs�pairs of molecules that are highly similar in their structure but exhibit large differences in potency�have received limited attention for their effect on model performance. Not only are these edge cases informative for molecule discovery and optimization but also models that are well equipped to accurately predict the potency of activity cliffs have increased potential for prospective applications. Our work aims to fill the current knowledge gap on best-practice machine learning methods in the presence of activity cliffs. We benchmarked a total of 24 machine and deep learning approaches on curated bioactivity data from 30 macromolecular targets for their performance on activity cliff compounds. While all methods struggled in the presence of activity cliffs, machine learning approaches based on molecular descriptors outperformed more complex deep learning methods. Our findings highlight large case-by-case differences in performance, advocating for (a) the inclusion of dedicated "activity-cliff-centered" metrics during model development and evaluation and (b) the development of novel algorithms to better predict the properties of activity cliffs. To this end, the methods, metrics, and results of this study have been encapsulated into an open-access benchmarking platform named MoleculeACE (Activity Cliff Estimation, available on GitHub at: https://github.com/molML/MoleculeACE). MoleculeACE is designed to steer the community toward addressing the pressing but overlooked limitation of molecular machine learning models posed by activity cliffs.

show abstract

Perplexity-based molecule ranking and bias estimation of chemical language models

Cited by 4 publications

References 0 publications

Exposing the limitations of molecular machine learning with activity cliffs.

Exposing the limitations of molecular machine learning with activity cliffs.

Exposing the limitations of molecular machine learning with activity cliffs.

Exposing the Limitations of Molecular Machine Learning with Activity Cliffs

Contact Info

Product

Resources

About