We introduce a machine learning model to predict atomization energies of a diverse set of organic molecules, based on nuclear charges and atomic positions only. The problem of solving the molecular Schrödinger equation is mapped onto a non-linear statistical regression problem of reduced complexity. Regression models are trained on and compared to atomization energies computed with hybrid density-functional theory. Cross-validation over more than seven thousand small organic molecules yields a mean absolute error of ∼10 kcal/mol. Applicability is demonstrated for the prediction of molecular atomization potential energy curves.Solving the Schrödinger equation (SE), HΨ = EΨ, for assemblies of atoms is a fundamental problem in quantum mechanics. Alas, solutions that are exact up to numerical precision are intractable for all but the smallest systems with very few atoms. Hierarchies of approximations have evolved, usually trading accuracy for computational efficiency [1]. Conventionally, the external potential, defined by a set of nuclear charges {Z I } and atomic positions {R I }, uniquely determines the Hamiltonian H of any system, and thereby the potential energy by optimizingFor a diverse set of organic molecules, we show that one can use machine learning (ML) instead, {Z I , R I } ML −→ E. Thus, we circumvent the task of explicitly solving the SE by training once a machine on a finite subset of known solutions. Since many interesting questions in physics require to repeatedly solve the SE, the highly competitive performance of our ML approach may pave the way to large scale exploration of molecular energies in chemical compound space [3,4]. ML techniques have recently been used with success to map the problem of solving complex physical differential equations to statistical models. Successful attempts include solving Fokker-Planck stochastic differential equations [5], parameterizing interatomic force fields for fixed chemical composition [6,7], and the discovery of novel ternary oxides for batteries [8]. Motivated by these, and other related efforts [9-12], we develop a non-linear regression ML model for computing molecular atomization energies in chemical compound space [3]. Our model is based on a measure of distance in compound space that accounts for both stoichiometry and configurational variation. After training, energies are predicted for new (out-of-sample) molecular systems, differing in composition and geometry, at negligible computational cost, i.e. milli seconds instead of hours on a conventional CPU. While the model is trained and tested using atomization energies calculated at the hybrid density-functional theory (DFT) level [2,13,14], any other training set or level of theory could be used as a starting point for subsequent ML training. Cross-validation on 7165 molecules yields a mean absolute error of 9.9 kcal/mol, which is an order of magnitude more accurate than counting bonds or semi-empirical quantum chemistry.We use the GDB data base, a library of nearly one billion organic molecules that ...
Computational de novo design of new drugs and materials requires rigorous and unbiased exploration of chemical compound space. However, large uncharted territories persist due to its size scaling combinatorially with molecular size. We report computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. Furthermore, for the predominant stoichiometry, C7H10O2, there are 6,095 constitutional isomers among the 134k molecules. We report energies, enthalpies, and free energies of atomization at the more accurate G4MP2 level of theory for all of them. As such, this data set provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.
Chemically accurate and comprehensive studies of the virtual space of all possible molecules are severely limited by the computational cost of quantum chemistry. We introduce a composite strategy that adds machine learning corrections to computationally inexpensive approximate legacy quantum methods. After training, highly accurate predictions of enthalpies, free energies, entropies, and electron correlation energies are possible, for significantly larger molecular sets than used for training. For thermochemical properties of up to 16k isomers of C7H10O2 we present numerical evidence that chemical accuracy can be reached. We also predict electron correlation energy in post Hartree-Fock methods, at the computational cost of Hartree-Fock, and we establish a qualitative relationship between molecular entropy and electron correlation. The transferability of our approach is demonstrated, using semiempirical quantum chemistry and machine learning models trained on 1 and 10% of 134k organic molecules, to reproduce enthalpies of all remaining molecules at density functional theory level of accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.