Simultaneously accurate and efficient prediction of molecular properties throughout chemical compound space is a critical ingredient toward rational compound design in chemical and pharmaceutical industries. Aiming toward this goal, we develop and apply a systematic hierarchy of efficient empirical methods to estimate atomization and total energies of molecules. These methods range from a simple sum over atoms, to addition of bond energies, to pairwise interatomic force fields, reaching to the more sophisticated machine learning approaches that are capable of describing collective interactions between many atoms or bonds. In the case of equilibrium molecular geometries, even simple pairwise force fields demonstrate prediction accuracy comparable to benchmark energies calculated using density functional theory with hybrid exchange-correlation functionals; however, accounting for the collective many-body interactions proves to be essential for approaching the “holy grail” of chemical accuracy of 1 kcal/mol for both equilibrium and out-of-equilibrium geometries. This remarkable accuracy is achieved by a vectorized representation of molecules (so-called Bag of Bonds model) that exhibits strong nonlocality in chemical space. In addition, the same representation allows us to predict accurate electronic properties of molecules, such as their polarizability and molecular frontier orbital energies.
The accurate and reliable prediction of properties of molecules typically requires computationally intensive quantum-chemical calculations. Recently, machine learning techniques applied to ab initio calculations have been proposed as an efficient approach for describing the energies of molecules in their given ground-state structure throughout chemical compound space (Rupp et al. Phys. Rev. Lett. 2012, 108, 058301). In this paper we outline a number of established machine learning techniques and investigate the influence of the molecular representation on the methods performance. The best methods achieve prediction errors of 3 kcal/mol for the atomization energies of a wide variety of molecules. Rationales for this performance improvement are given together with pitfalls and challenges when applying machine learning approaches to the prediction of quantum-mechanical observables.
The combination of modern scientific computing with electronic structure theory can lead to an unprecedented amount of data amenable to intelligent data analysis for the identification of meaningful, novel and predictive structure-property relationships. Such relationships enable highthroughput screening for relevant properties in an exponentially growing pool of virtual compounds that are synthetically accessible. Here, we present a machine learning model, trained on a database of ab initio calculation results for thousands of organic molecules, that simultaneously predicts multiple electronic 7
Machine learning is used to approximate density functionals. For the model problem of the kinetic energy of non-interacting fermions in 1d, mean absolute errors below 1 kcal/mol on test densities similar to the training set are reached with fewer than 100 training densities. A predictor identifies if a test density is within the interpolation region. Via principal component analysis, a projected functional derivative finds highly accurate self-consistent densities. Challenges for application of our method to real electronic structure problems are discussed.PACS numbers: 31.15. 02.60.Gf, 89.20.Ff Each year, more than 10,000 papers report solutions to electronic structure problems using Kohn-Sham (KS) density functional theory (DFT) [1,2]. All approximate the exchange-correlation (XC) energy as a functional of the electronic spin densities. The quality of the results depends crucially on these density functional approximations. For example, present approximations often fail for strongly correlated systems, rendering the methodology useless for some of the most interesting problems.Thus, there is a never-ending search for improved XC approximations. The original local density approximation (LDA) of Kohn and Sham [2] is uniquely defined by the properties of the uniform gas, and has been argued to be a universal limit of all systems [3]. But the refinements that have proven useful in chemistry [4] and materials [5] are not, and differ both in their derivations and details. Traditionally, physicists favor a non-empirical approach, deriving approximations from quantum mechanics and avoiding fitting to specific finite systems [6]. Such non-empirical functionals can be considered controlled extrapolations that work well across a broad range of systems and properties, bridging the divide between molecules and solids. Chemists typically use a few [7,8] or several dozen [9] parameters to improve accuracy on a limited class of molecules. Empirical functionals are limited interpolations that are more accurate for the molecular systems they are fitted to, but often fail for solids. Passionate debates are fueled by this cultural divide.Machine learning (ML) is a powerful tool for finding patterns in high-dimensional data. ML employs algorithms by which the computer learns from empirical data via induction, and has been very successful in many applications [10][11][12]. In ML, intuition is used to choose the basic mechanism and representation of the data, but not directly applied to the details of the model. Mean errors can be systematically decreased with increasing number of inputs. In contrast, human-designed empirical approximations employ standard forms derived from general principles, fitting the parameters to training sets. These explore only an infinitesimal fraction of all possible functionals and use relatively few data points.DFT works for electronic structure because the underlying many-body Hamiltonian is simple, while accurate solution of the Schrödinger equation is very demanding. All electrons Coulomb repel ...
The estimation of accuracy and applicability of QSAR and QSPR models for biological and physicochemical properties represents a critical problem. The developed parameter of "distance to model" (DM) is defined as a metric of similarity between the training and test set compounds that have been subjected to QSAR/QSPR modeling. In our previous work, we demonstrated the utility and optimal performance of DM metrics that have been based on the standard deviation within an ensemble of QSAR models. The current study applies such analysis to 30 QSAR models for the Ames mutagenicity data set that were previously reported within the 2009 QSAR challenge. We demonstrate that the DMs based on an ensemble (consensus) model provide systematically better performance than other DMs. The presented approach identifies 30-60% of compounds having an accuracy of prediction similar to the interlaboratory accuracy of the Ames test, which is estimated to be 90%. Thus, the in silico predictions can be used to halve the cost of experimental measurements by providing a similar prediction accuracy. The developed model has been made publicly available at http://ochem.eu/models/1 .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.