Machine learning can predict the major regio-, site-, and diastereoselective outcomes of Diels-Alder reactions better than standardq uantum-mechanical methods and with accuracies exceeding 90 %provided that i) the diene/dienophile substrates are represented by "physical-organic" descriptors reflecting the electronic and steric characteristics of their substituents and ii)t he positions of such substituents relative to the reaction core are encoded ("vectorized") in an informative way.
Applications of machine
learning (ML) to synthetic chemistry rely
on the assumption that large numbers of literature-reported examples
should enable construction of accurate and predictive models of chemical
reactivity. This paper demonstrates that abundance of carefully curated
literature data may be insufficient for this purpose. Using an example
of Suzuki–Miyaura coupling with heterocyclic building blocks—and
a carefully selected database of >10,000 literature examples—we
show that ML models cannot offer any meaningful predictions of optimum
reaction conditions, even if the search space is restricted to only
solvents and bases. This result holds irrespective of the ML model
applied (from simple feed-forward to state-of-the-art graph-convolution
neural networks) or the representation to describe the reaction partners
(various fingerprints, chemical descriptors, latent representations,
etc.). In all cases, the ML methods fail to perform significantly
better than naive assignments based on the sheer frequency of certain
reaction conditions reported in the literature. These unsatisfactory
results likely reflect subjective preferences of various chemists
to use certain protocols, other biasing factors as mundane as availability
of certain solvents/reagents, and/or a lack of negative data. These
findings highlight the likely importance of systematically generating
reliable and standardized data sets for algorithm training.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.