2018
DOI: 10.1101/292177
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Can AI reproduce observed chemical diversity?

Abstract: Generating diverse molecules with desired chemical properties is important for drug discovery. The use of generative neural networks is promising for this task. To facilitate evaluation of generative models, this paper introduces a metric of internal chemical diversity, and raises the following challenge: can a nontrivial AI model reproduce observed internal diversity for desired molecules? To illustrate this metric, a mini-benchmark is performed with two generative models: a Reinforcement Learning model and t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
28
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
3
2
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 58 publications
(28 citation statements)
references
References 23 publications
0
28
0
Order By: Relevance
“…Both platforms occasionally generated very small, non-specific structure predictions (for example, a single unspecified amino acid or a single malonyl unit) that did not provide actionable information about the chemical structure of the encoded product; to remove these from consideration, we applied a molecular weight filter to remove structures under 100 Da output by either platform. To evaluate the internal structural diversity of each set of predicted structures, we computed the distribution of pairwise Tcs for each set 45 , taking the median pairwise Tc instead of the mean as a summary statistic to ensure robustness against outliers. Structural similarity to known natural products was assessed using the RDKit implementation of the 'natural product-likeness' score 22 , and by the median Tc between predicted structures and the known secondary metabolite structures deposited in the NP Atlas database 46 .…”
Section: Methodsmentioning
confidence: 99%
“…Both platforms occasionally generated very small, non-specific structure predictions (for example, a single unspecified amino acid or a single malonyl unit) that did not provide actionable information about the chemical structure of the encoded product; to remove these from consideration, we applied a molecular weight filter to remove structures under 100 Da output by either platform. To evaluate the internal structural diversity of each set of predicted structures, we computed the distribution of pairwise Tcs for each set 45 , taking the median pairwise Tc instead of the mean as a summary statistic to ensure robustness against outliers. Structural similarity to known natural products was assessed using the RDKit implementation of the 'natural product-likeness' score 22 , and by the median Tc between predicted structures and the known secondary metabolite structures deposited in the NP Atlas database 46 .…”
Section: Methodsmentioning
confidence: 99%
“…• The internal diversity 37 , defined as the mean Tanimoto coefficient between all pairs of molecules generated by the model. Extended connectivity fingerprints 64 with a diameter of 3 and a length of 1,024 bits were used as input to the calculation of the Tanimoto coefficient.…”
Section: Methodsmentioning
confidence: 99%
“…To accomplish this goal, we calculated a suite of 23 different metrics that have previously been proposed for the evaluation of generative models of molecules 18,23,24,34,[37][38][39] . In addition to the proportion of valid SMILES strings, we also computed the proportions of unique and novel molecules generated by the model (Supplementary Fig.…”
Section: Deep Generative Models Learn From Limited Training Datamentioning
confidence: 99%
“…We then quantified the similarity of the distributions observed for generated molecules and the training set using the Jensen-Shannon divergence. To specifically assess the diversity of the generated molecules, we calculated the mean Tanimoto coefficient between random pairs of generated molecules, or random pairs of generated and training set molecules, to obtain the internal and external diversities, respectively 37 . Finally, we computed the Fréchet ChemNet distance 38 , a metric based on the predicted biological activities of the generated molecules that was developed specifically for the evaluation of chemical generative models.…”
Section: Deep Generative Models Learn From Limited Training Datamentioning
confidence: 99%