Modern machine learning provides promising methods for accelerating the discovery and characterization of novel chemical species. However, in many areas experimental data remain costly and scarce, and computational models are unavailable for targeted figures of merit. Here we report a promising pathway to address this challenge by using chemical latent space enrichment, whereby disparate data sources are combined in joint prediction tasks to enable improved prediction in data-scarce applications. The approach is demonstrated for pK a prediction of moderately sized molecular species using a combination of experimentally available pK a data and density functional theory-based characterizations of the (de)protonation free energy. A novel autoencoder framework is used to create a continuous chemical latent space that is then used in single and joint training tasks for property prediction. By combining these two data sets in a jointly trained autoencoder framework, we observe mutual improvement in property prediction tasks in the scarce data limit. We also demonstrate an enrichment mechanism that is unique to latent space training, whereby training on excess computational data can mitigate the prediction losses associated with scarce experimental data and advantageously organize the latent space. These results demonstrate that disparate chemical data sources can be advantageously combined in an autoencoder framework with potential general application to data-scarce chemical learning tasks.
Transfer learning is a subfield of machine learning that leverages proficiency in one or more prediction tasks to improve proficiency in a related task. For chemical property prediction, transfer learning models represent a promising approach for addressing the data scarcity limitations of many properties by utilizing potentially abundant data from one or more adjacent applications. Transfer learning models typically utilize a latent variable that is common to several prediction tasks and provides a mechanism for information exchange between tasks. For chemical applications, it is still largely unknown how correlation between the prediction tasks affects performance, the limitations on the number of tasks that can be simultaneously trained in these models before incurring performance degradation, and if transfer learning positively or negatively affects ancillary model properties. Here we investigate these questions using an autoencoder latent space as a latent variable for transfer learning models for predicting properties from the QM9 data set that have been supplemented with semiempirical quantum chemistry calculations. We demonstrate that property prediction can be counterintuitively improved by utilizing a simpler linear predictor model, which has the effect of forcing the latent space to organize linearly with respect to each property. In data scarce prediction tasks, the transfer learning improvement is dramatic, whereas in data rich prediction tasks, there appears to be little adverse impact of transfer learning on prediction performance. The transfer learning approach demonstrated here thus represents a highly advantageous supplement to property prediction models with no downside in implementation.
Gelatin is a popular material for the creation of tissue phantoms due to its ease-of-use, safety, low relative cost, and its amenability to tuning physical properties through the use of additives. One difficulty that arises when using gelatin, especially in low concentrations, is the brittleness of the material. In this paper, we show that small additions of another common biological polymer, sodium alginate, significantly increase the toughness of gelatin without changing the Young’s modulus or other low-strain stress relaxation properties of the material. Samples were characterized using ramp-hold stress relaxation tests. The experimental data from these tests were then fit to the Generalized Maxwell (GM) model, as well as two models based on a fractional calculus approach: the Kelvin–Voigt Fractional Derivative (KVFD) and Fractional Maxwell (FM) models. We found that for our samples, the fractional models provided better fits with fewer parameters, and at strains within the linear elastic region, the linear viscoelastic parameters of the alginate/gelatin and pure gelatin samples were essentially indistinguishable. When the same ramp-hold stress relaxation experiments were run at high strains outside of the linear elastic region, we observed a shift in stress relaxation to shorter time scales with increasing sodium alginate addition, which may be associated with an increase in fluidity within the gelatin matrix. This leads us to believe that sodium alginate acts to enhance the viscosity within the fluidic region of the gelatin matrix, providing additional energy dissipation without raising the modulus of the material. These results are applicable to anyone desiring independent control of the Young’s modulus and toughness in preparing tissue phantoms, and suggest that sodium alginate should be added to low-modulus gelatin for use in biological and medical testing applications.
Computational predictions of the thermodynamic properties of molecules and materials play a central role in contemporary reaction prediction and kinetic modeling. Due to the lack of experimental data and computational cost of high-level quantum chemistry methods, approximate methods based on additivity schemes and more recently machine learning are currently the only approaches capable of supplying the chemical coverage and throughput necessary for such applications. For both approaches, ring-containing molecules pose a challenge to transferability due to the nonlocal interactions associated with conjugation and strain that significantly impact thermodynamic properties. Here, we report the development of a self-consistent approach for parameterizing transferable ring corrections based on high-level quantum chemistry. The method is benchmarked against both the Pedley–Naylor–Kline experimental dataset for C-, H-, O-, N-, S-, and halogen-containing cyclic molecules and a dataset of Gaussian-4 quantum chemistry calculations. The prescribed approach is demonstrated to be superior to existing ring corrections while maintaining extensibility to arbitrary chemistries. We have also compared this ring-correction scheme against a novel machine learning approach and demonstrate that the latter is capable of exceeding the performance of physics-based ring corrections.
Generative models are a sub-class of machine learning models that are capable of generating new samples with a target set of properties. In chemical and materials applications, these new samples might be drug targets, novel semiconductors, or catalysts constrained to exhibit an application-specific set of properties. Given their potential to yield high-value targets from otherwise intractable design spaces, generative models are currently under intense study with respect to how predictions can be improved through changes in model architecture and data representation. Here we explore the potential of multi-task transfer learning as a complementary approach to improving the validity and property specificity of molecules generated by such models. We have compared baseline generative models trained on a single property prediction task against models trained on additional ancillary prediction tasks and observe a generic positive impact on the validity and specificity of the multi-task models. In particular, we observe that the validity of generated structures is strongly affected by whether or not the models have chemical property data, as opposed to only syntactic structural data, supplied during learning. We demonstrate this effect in both interpolative and extrapolative scenarios (i.e., where the generative targets are poorly represented in training data) for models trained to generate high energy structures and models trained to generated structures with targeted bandgaps within certain ranges. In both instances, the inclusion of additional chemical property data improves the ability of models to generate valid, unique structures with increased property specificity. This approach requires only minor alterations to existing generative models, in many cases leveraging prediction frameworks already native to these models. Additionally, the transfer learning strategy is complementary to ongoing efforts to improve model architectures and data representation and can foreseeably be stacked on top of these developments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.