First-principles
prediction of nuclear magnetic resonance chemical
shifts plays an increasingly important role in the interpretation
of experimental spectra, but the required density functional theory
(DFT) calculations can be computationally expensive. Promising machine
learning models for predicting chemical shieldings in general organic
molecules have been developed previously, though the accuracy of those
models remains below that of DFT. The present study demonstrates how
much higher accuracy chemical shieldings can be obtained via the Δ-machine
learning approach, with the result that the errors introduced by the
machine learning model are only one-half to one-third the errors expected
for DFT chemical shifts relative to experiment. Specifically, an ensemble
of neural networks is trained to correct PBE0/6-31G chemical shieldings
up to the target level of PBE0/6-311+G(2d,p). It can predict 1H, 13C, 15N, and 17O chemical
shieldings with root-mean-square errors of 0.11, 0.70, 1.69, and 2.47
ppm, respectively. At the same time, the Δ-machine learning
approach is 1–2 orders of magnitude faster than the target
large-basis calculations. It is also demonstrated that the machine
learning model predicts experimental solution-phase NMR chemical shifts
in drug molecules with only modestly worse accuracy than the target
DFT model. Finally, the ability to estimate the uncertainty in the
predicted shieldings based on variations within the ensemble of neural
network models is also assessed.
Ab initio nuclear magnetic resonance chemical shift prediction provides an important tool for interpreting and assigning experimental spectra, but it becomes computationally prohibitive in large systems. The computational costs can be reduced considerably by fragmentation of the large system into a series of contributions from many smaller subsystems. However, the presence of charged functional groups and the need to partition the system across covalent bonds create complications in biomolecules that typically require the use of large fragments and careful descriptions of the electrostatic environment. The present work shows how a model that combines chemical shielding contributions from non-overlapping monomer and dimer fragments embedded in a polarizable continuum model provides a simple, easy-to-implement, and computationally inexpensive approach for predicting chemical shifts in complex systems. The model's performance proves rather insensitive to the continuum dielectric constant, making the 1 This article is protected by copyright. All rights reserved. This is the author manuscript accepted for publication and has undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as
Machine
learning (ML) offers an attractive method for making predictions
about molecular systems while circumventing the need to run expensive
electronic structure calculations. Once trained on ab initio data,
the promise of ML is to deliver accurate predictions of molecular
properties that were previously computationally infeasible. In this
work, we develop and train a graph neural network model to correct
the basis set incompleteness error (BSIE) between a small and large
basis set at the RHF and B3LYP levels of theory. Our results show
that, when compared to fitting to the total potential, an ML model
fitted to correct the BSIE is better at generalizing to systems not
seen during training. We test this ability by training on single molecules
while evaluating on molecular complexes. We also show that ensemble
models yield better behaved potentials in situations where the training
data is insufficient. However, even when only fitting to the BSIE,
acceptable performance is only achieved when the training data sufficiently
resemble the systems one wants to make predictions on. The test error
of the final model trained to predict the difference between the cc-pVDZ
and cc-pV5Z potential is 0.184 kcal/mol for the B3LYP density functional,
and the ensemble model accurately reproduces the large basis set interaction
energy curves on the S66x8 dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.