Knowledge mining from synthetic biology journal articles for machine learning (ML) applications is a labor-intensive process. The development of natural language processing (NLP) tools, such as GPT-4, can greatly accelerate data extraction for machine learning to predict microbial performance under complex strain engineering and bioreactor conditions. As a proof of concept, GPT-4 was used to extract knowledge from 176 publications, resulting in 2037 data instances uploaded to a crowdsourcing online database. The centralized datasets and feature selection enabled a random forest model to predict fermentation titers of an industrial important yeast (Yarrowia lipolytica) with high accuracy (R2 of 0.86 for unseen test data). Via transfer learning, the trained model could assess production capability of nonconventional yeasts (e.g., Rhodosporidium toruloides). This work showed the potential of generative AI to automate information extraction from research articles and advanced AI applications to facilitate design-build-test-learn (DBTL) for biomanufacturing as well as biotech commercial decisions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.