We use prompt engineering to guide ChatGPT in the automation
of
text mining of metal–organic framework (MOF) synthesis conditions
from diverse formats and styles of the scientific literature. This
effectively mitigates ChatGPT’s tendency to hallucinate information,
an issue that previously made the use of large language models (LLMs)
in scientific fields challenging. Our approach involves the development
of a workflow implementing three different processes for text mining,
programmed by ChatGPT itself. All of them enable parsing, searching,
filtering, classification, summarization, and data unification with
different trade-offs among labor, speed, and accuracy. We deploy this
system to extract 26 257 distinct synthesis parameters pertaining
to approximately 800 MOFs sourced from peer-reviewed research articles.
This process incorporates our ChemPrompt Engineering strategy to instruct
ChatGPT in text mining, resulting in impressive precision, recall,
and F1 scores of 90–99%. Furthermore, with the data set built
by text mining, we constructed a machine-learning model with over
87% accuracy in predicting MOF experimental crystallization outcomes
and preliminarily identifying important factors in MOF crystallization.
We also developed a reliable data-grounded MOF chatbot to answer questions
about chemical reactions and synthesis procedures. Given that the
process of using ChatGPT reliably mines and tabulates diverse MOF
synthesis information in a unified format while using only narrative
language requiring no coding expertise, we anticipate that our ChatGPT
Chemistry Assistant will be very useful across various other chemistry
subdisciplines.
We report a new deep learning message passing network that takes inspiration from Newton's equations of motion to learn interatomic potentials and forces. With the advantage of directional information from...
The power of structural information for informing biological
mechanisms
is clear for stable folded macromolecules, but similar structure–function
insight is more difficult to obtain for highly dynamic systems such
as intrinsically disordered proteins (IDPs) which must be described
as structural ensembles. Here, we present IDPConformerGenerator, a
flexible, modular open-source software platform for generating large
and diverse ensembles of disordered protein states that builds conformers
that obey geometric, steric, and other physical restraints on the
input sequence. IDPConformerGenerator samples backbone phi (φ),
psi (ψ), and omega (ω) torsion angles of relevant sequence
fragments from loops and secondary structure elements extracted from
folded protein structures in the RCSB Protein Data Bank and builds
side chains from robust Monte Carlo algorithms using expanded rotamer
libraries. IDPConformerGenerator has many user-defined options enabling
variable fractional sampling of secondary structures, supports Bayesian
models for assessing the agreement of IDP ensembles for consistency
with experimental data, and introduces a machine learning approach
to transform between internal and Cartesian coordinates with reduced
error. IDPConformerGenerator will facilitate the characterization
of disordered proteins to ultimately provide structural insights into
these states that have key biological functions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.