In this work, we investigate the question: do code-generating large language models know chemistry? Our results indicate, mostly yes. To evaluate this, we introduce an expandable framework for evaluating chemistry...
We present three deep learning sequence-based prediction models for peptide properties including hemolysis, solubility, and resistance to nonspecific interactions that achieve comparable results to the state-of-the-art models. Our sequencebased solubility predictor, MahLooL, outperforms the current state-of-the-art methods for short peptides. These models are implemented as a static website without the use of a dedicated server or cloud computing. Web-based models like this allow for accessible and effective reproducibility. Most existing approaches rely on third-party servers that typically require upkeep and maintenance. Our predictive models do not require servers, require no installation of dependencies, and work across a range of devices. The specific architecture is bidirectional recurrent neural networks. This serverless approach is a demonstration of edge machine learning that removes the dependence on cloud providers. The code and models are accessible at https://github.com/urwhitelab/peptide-dashboard.
We present three deep learning sequence prediction models for hemolysis, solubility, and resistance to non-specific interactions of peptides that achieve comparable results to the state-of-the art models. These predictive models share a common architecture of bidirectional recurrent neural networks (LSTM). These models are implemented in JavaScript so that they can be run on a static website without use of a dedicated server. This removes the cost, and long-term management of a server, while still enabling open and free access to the models. This "serverless" prediction model is a demonstration of edge computing bioinformatics and removes the dependence on cloud providers or self-hosting of resource-rich academic institutions. This is feasible because of the continued track of Moore's law and ubiquitous hardware acceleration of deep learning computations on new phones and desktops.
Computational fluid dynamics (CFD) analysis is widely used in chemical engineering. Although CFD calculations are accurate, the computational cost associated with complex systems makes it difficult to obtain empirical equations between system variables. Here, we combine active learning (AL) and symbolic regression (SR) to get a symbolic equation for system variables from CFD simulations. Gaussian process regression‐based AL allows for automated selection of variables by selecting the most instructive points from the available range of possible parameters. The results from these experiments are then passed to SR to find empirical symbolic equations for CFD models. This approach is scalable and applicable for any desired number of CFD design parameters. To demonstrate the effectiveness, we use this method with two model systems. We recover an empirical equation for the pressure drop in a bent pipe and a new equation for predicting backflow in a heart valve under aortic insufficiency.
In this work, we investigate the question: do code-generating large language models know chemistry? Our results indicate, mostly yes. To evaluate this, we produce a benchmark set of problems, and evaluate these models based on correctness of code by automated testing and evaluation by experts. We find recent LLMs are able to write correct code across a variety of topics in chemistry and their accuracy can be increased by 30 percentage points via prompt engineering strategies, like putting copyright notices at the top of files. These dataset and evaluation tools are open source which can be contributed to or built upon by future researchers, and will serve as a community resource for evaluating the performance of new models as they emerge. We also describe some good practices for employing LLMs in chemistry. The general success of these models demonstrates that their impact on chemistry teaching and research is poised to be enormous.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.