Conspectus
Machine-readable
chemical structure representations are foundational
in all attempts to harness machine learning for the prediction of
reactivities, selectivities, and chemical properties directly from
molecular structure. The featurization of discrete chemical structures
into a continuous vector space is a critical phase undertaken before
model selection, and the development of new ways to quantitatively
encode molecules is an active area of research. In this Account, we
highlight the application and suitability of different representations,
from expert-guided “engineered” descriptors to automatically
“learned” features, in different prediction tasks relevant
to organic and organometallic chemistry, where differing amounts of
training data are available. These tasks include statistical models
of stereo- and enantioselectivity, thermochemistry, and kinetics developed
using experimental and quantum chemical data.
The use of expert-guided
molecular descriptors provides an opportunity
to incorporate chemical knowledge, domain expertise, and physical
constraints into statistical modeling. In applications to stereoselective
organic and organometallic catalysis, where data sets may be relatively
small and 3D-geometries and conformations play an important role,
mechanistically informed features can be used successfully to obtain
predictive statistical models that are also chemically interpretable.
We provide an overview of several recent applications of this approach
to obtain quantitative models for reactivity and selectivity, where
topological descriptors, quantum mechanical calculations of electronic
and steric properties, along with conformational ensembles, all feature
as essential ingredients of the molecular representations used.
Alternatively, more flexible, general-purpose molecular representations
such as attributed molecular graphs can be used with machine learning
approaches to learn the complex relationship between a structure and
prediction target. This approach has the potential to out-perform
more traditional representation methods such as “hand-crafted”
molecular descriptors, particularly as data set sizes grow. One area
where this is particularly relevant is in the use of large sets of
quantum mechanical data to train quantitative structure–property
relationships. A general approach toward curating useful data sets
and training highly accurate graph neural network models is discussed
in the context of organic bond dissociation enthalpies, where this
strategy outperforms regression using precomputed descriptors.
Finally, we describe how graph neural network predictions can be
incorporated into mechanistically informed statistical models of chemical
reactivity and selectivity. Once trained, this approach avoids the
expensive computational overhead associated with quantum mechanical
calculations, while maintaining chemical interpretability. We illustrate
examples for which fast predictions of bond dissociation enthalpy
and of the identities of radicals formed through cleavage of a molecule’s
we...