Conspectus
Catalyst
design in enantioselective catalysis has historically
been driven by empiricism. In this endeavor, experimentalists attempt
to qualitatively identify trends in structure that lead to a desired
catalyst function. In this body of work, we lay the groundwork for
an improved, alternative workflow that uses quantitative methods to
inform decision making at every step of the process. At the outset,
we define a library of synthetically accessible permutations of a
catalyst scaffold with the philosophy that the library contains every
potential catalyst we are willing to make. To represent these chiral
molecules, we have developed general 3D representations, which can
be calculated for tens of thousands of structures. This defines the
total chemical space of a given catalyst scaffold; it is constructed
on the basis of catalyst structure only without regard to a specific
reaction or mechanism. As such, any algorithmic subset selection method,
which is unsupervised (i.e., only considers catalyst structure), should
provide an ideal initial screening set for any new reaction that can
be catalyzed by that scaffold. Notably, because this design strategy,
the same set of catalysts can be used for any reaction that can be
catalyzed with that parent catalyst scaffold. These are tested experimentally,
and statistical learning tools can be used to create a model relating
catalyst structure to catalyst function. Further, this model can be
used to predict the performance of each catalyst candidate in the
greater database of virtual catalyst candidates. In this way, it is
possible estimate the performance of tens of thousands of catalysts
by experimentally testing a smaller subset. Using error assessment
metrics, it is possible to understand the confidence in new predictions.
An experimentalist using this tool can balance the predicted results
(reward) with the prediction confidence (risk) when deciding which
catalysts to synthesize next in an optimization campaign. These catalysts
are synthesized and tested experimentally. At this stage, either the
optimization is a success or the predicted values were incorrect and
further optimization is required. In the case of the latter, the information
can be fed back into the statistical learning model to refine the
model, and this iterative process can be used to determine the optimal
catalyst. In this body of work, we not only establish this workflow
but quantitatively establish how best to execute each step. Herein,
we evaluate several 3D molecular representations to determine how
best to represent molecules. Several selection protocols are examined
to best decide which set of molecules can be used to represent the
library of interest. In addition, the number of reactions needed to
make accurate, statistical learning models is evaluated. Taken together
these components establish a tool ready to progress from the development
stage to the utility stage. As such, current research endeavors focus
on applying these tools to optimize new reactions.