2022
DOI: 10.26434/chemrxiv-2022-mfq52-v3
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Exposing the limitations of molecular machine learning with activity cliffs.

Abstract: Machine learning has become a crucial tool in drug discovery and chemistry at large, e.g., to predict molecular properties, such as bioactivity, with high accuracy. However, activity cliffs – pairs of molecules that are highly similar in their structure but exhibit large differences in potency – have been underinvestigated for their effect on model performance. Not only are these edge cases informative for molecule discovery and optimization, but models that are well-equipped to accurately predict the potency … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
13
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 9 publications
(13 citation statements)
references
References 71 publications
0
13
0
Order By: Relevance
“…TDC refers to 12 data sets used as regression benchmarks and provided by the Therapeutic Data Commons. 47 ChEMBL refers to 30 data sets curated from the ChEMBL database 62 by van Tilborg et al 11 The relative RMSE is the average, normalized RMSE obtained from 5-fold cross validation. Details of the fingerprints and descriptors used as molecular representations can be found in Methods.…”
Section: ■ Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…TDC refers to 12 data sets used as regression benchmarks and provided by the Therapeutic Data Commons. 47 ChEMBL refers to 30 data sets curated from the ChEMBL database 62 by van Tilborg et al 11 The relative RMSE is the average, normalized RMSE obtained from 5-fold cross validation. Details of the fingerprints and descriptors used as molecular representations can be found in Methods.…”
Section: ■ Resultsmentioning
confidence: 99%
“…Three sets of regression tasks were used in this work. Structure−property landscapes related to regression tasks were retrieved from the Therapeutic Data Commons (TDC) 47 using the Python library PyTDC (v. 0.3.6) and from the previous work of van Tilborg et al 11 A total of 55 regression data sets, split across three groups, were considered.…”
Section: ■ Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…8,9 ACs were rst accurately predicted using support vector machine (SVM) modeling on the basis of special kernel functions enabling compound pair predictions. 9 These ndings have also catalyzed further AC predictions using SVR variants [10][11][12] and other methods, [13][14][15][16][17][18] as discussed below. Recently, various deep neural network architectures have been used to predict ACs from images 14,15 and molecular graphs using representation learning 16 or derive regression models for potency prediction of AC compounds.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, various deep neural network architectures have been used to predict ACs from images 14,15 and molecular graphs using representation learning 16 or derive regression models for potency prediction of AC compounds. 17,18 In this work, we further extend this methodological spectrum by introducing chemical language models for combined AC prediction and generative compound design. Compared to earlier studies predicting ACs using classication models, the approach presented herein was designed to extend AC predictions with the capacity to produce new AC compounds, thus integrating predictive and generative modeling in the context of AC analysis and AC-based compound design.…”
Section: Introductionmentioning
confidence: 99%