Time-split cross-validation is broadly recognized as the gold standard for validating predictive models intended for use in medicinal chemistry projects. Unfortunately this type of data is not broadly available outside of large pharmaceutical research organizations. Here we introduce the SIMPD (simulated medicinal chemistry project data) algorithm to split public data sets into training and test sets that mimic the differences observed in real-world medicinal chemistry project data sets. SIMPD uses a multi-objective genetic algorithm with objectives derived from an extensive analysis of the differences between early and late compounds in more than 130 lead-optimization projects run within the Novartis Institutes for BioMedical Research. Applying SIMPD to the real-world data sets produced training/test splits which more accurately reflect the differences in properties and machine-learning performance observed for temporal splits than other standard approaches like random or neighbor splits. We applied the SIMPD algorithm to bioactivity data extracted from ChEMBL and created 56 public data sets which can be used for validating machine-learning models intended for use in the setting of a medicinal chemistry project. The SIMPD code and simulated data sets are available under open-source/open-data licenses at github.com/rinikerlab/molecular_time_series.
Machine-learning and deep-learning models have been extensively
used in cheminformatics to predict molecular properties, to reduce
the need for direct measurements, and to accelerate compound prioritization.
However, different setups and frameworks and the large number of molecular
representations make it difficult to properly evaluate, reproduce,
and compare them. Here we present a new PREdictive modeling FramEwoRk
for molecular discovery (PREFER), written in Python (version 3.7.7)
and based on AutoSklearn (version 0.14.7), that allows comparison
between different molecular representations and common machine-learning
models. We provide an overview of the design of our framework and
show exemplary use cases and results of several representation–model
combinations on diverse data sets, both public and in-house. Finally,
we discuss the use of PREFER on small data sets. The code of the framework
is freely available on GitHub.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.