The development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble's prediction. QBC allows the presented AL algorithm to automatically sample regions of chemical space where the ML potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach, we develop the COmprehensive Machine-learning Potential (COMP6) benchmark (publicly available on GitHub) which contains a diverse set of organic molecules. Active learning-based ANI potentials outperform the original random sampled ANI-1 potential with only 10% of the data, while the final active learning-based model vastly outperforms ANI-1 on the COMP6 benchmark after training to only 25% of the data. Finally, we show that our proposed AL technique develops a universal ANI potential (ANI-1x) that provides accurate energy and force predictions on the entire COMP6 benchmark. This universal ML potential achieves a level of accuracy on par with the best ML potentials for single molecules or materials, while remaining applicable to the general class of organic molecules composed of the elements CHNO.
Computational modeling of chemical and biological systems at atomic resolution is a crucial tool in the chemist’s toolset. The use of computer simulations requires a balance between cost and accuracy: quantum-mechanical methods provide high accuracy but are computationally expensive and scale poorly to large systems, while classical force fields are cheap and scalable, but lack transferability to new systems. Machine learning can be used to achieve the best of both approaches. Here we train a general-purpose neural network potential (ANI-1ccx) that approaches CCSD(T)/CBS accuracy on benchmarks for reaction thermochemistry, isomerization, and drug-like molecular torsions. This is achieved by training a network to DFT data then using transfer learning techniques to retrain on a dataset of gold standard QM calculations (CCSD(T)/CBS) that optimally spans chemical space. The resulting potential is broadly applicable to materials science, biology, and chemistry, and billions of times faster than CCSD(T)/CBS calculations.
Optically active molecular materials, such as organic conjugated polymers and biological systems, are characterized by strong coupling between electronic and vibrational degrees of freedom. Typically, simulations must go beyond the Born− Oppenheimer approximation to account for non-adiabatic coupling between excited states. Indeed, non-adiabatic dynamics is commonly associated with exciton dynamics and photophysics involving charge and energy transfer, as well as exciton dissociation and charge recombination. Understanding the photoinduced dynamics in such materials is vital to providing an accurate description of exciton formation, evolution, and decay. This interdisciplinary field has matured significantly over the past decades. Formulation of new theoretical frameworks, development of more efficient and accurate computational algorithms, and evolution of high-performance computer hardware has extended these simulations to very large molecular systems with hundreds of atoms, including numerous studies of organic semiconductors and biomolecules. In this Review, we will describe recent theoretical advances including treatment of electronic decoherence in surface-hopping methods, the role of solvent effects, trivial unavoided crossings, analysis of data based on transition densities, and efficient computational implementations of these numerical methods. We also emphasize newly developed semiclassical approaches, based on the Gaussian approximation, which retain phase and width information to account for significant decoherence and interference effects while maintaining the high efficiency of surface-hopping approaches. The above developments have been employed to successfully describe photophysics in a variety of molecular materials.
Computational modeling of chemical and biological systems at atomic resolution is a crucial tool in the chemist's toolset. The use of computer simulations requires a balance between cost and accuracy: quantum-mechanical methods provide high accuracy but are computationally expensive and scale poorly to large systems, while classical force fields are cheap and scalable, but lack transferability to new systems. Machine learning can be used to achieve the best of both approaches. Here we train a general-purpose neural network potential (ANI-1ccx) that approaches CCSD(T)/CBS accuracy on benchmarks for reaction thermochemistry, isomerization, and druglike molecular torsions. This is achieved by training a network to DFT data then using transfer learning techniques to retrain on a dataset of gold standard QM calculations (CCSD(T)/CBS) that optimally spans chemical space. The resulting potential is broadly applicable to materials science, biology and chemistry, and billions of times faster than CCSD(T)/CBS calculations.
Myoglobin (Mb) double mutant T67R/S92D displays peroxidase enzymatic activity in contrast to the wild type protein. The CO adduct of T67R/S92D shows two CO absorption bands corresponding to the A 1 and A 3 substates. The equilibrium protein dynamics for the two distinct substates of the Mb double mutant are investigated by using two dimensional infrared (2D IR) vibrational echo spectroscopy and molecular dynamics (MD) simulations. The time dependent changes in the 2D IR vibrational echo line shapes for both the substates are analyzed using the center line slope (CLS) method to obtain the frequency-frequency correlation function (FFCF). The results for the double mutant are compared to those from the wild type Mb. The experimentally determined FFCF is compared to the FFCF obtained from molecular dynamics simulations, thereby testing the capacity of a force field to determine the amplitudes and time scales of protein structural fluctuations on fast timescales. The results provide insights into the nature of the energy landscape around the free energy minimum of the folded protein structure.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.