Gene synthesis enables creation and modification of genetic sequences at an unprecedented pace, offering enormous potential for new biological functionality but also increasing the need for biosurveillance. In this paper, we introduce a bioinformatics technique for determining whether a gene is natural or synthetic based solely on nucleotide sequence. This technique, grounded in codon theory and machine learning, can correctly classify genes with 97.7% accuracy on a novel data set. We then classify ∼19,000 unique genes from the Addgene non-profit plasmid repository to investigate whether natural and synthetic genes have differential use in heterologous expression. Phylogenetic analysis of distance between source and expression organisms reveals that researchers are using synthesis to source genes from more genetically-distant organisms, particularly for longer genes. We provide empirical evidence that gene synthesis is leading biologists to sample more broadly across the diversity of life, and we provide a foundational tool for the biosurveillance community.
Abstract:Gene synthesis allows biologists and bioengineers to create novel genetic sequences and codonoptimize transgenes for heterologous expression. Because codon choice is key to gene expression and therefore cellular outcomes, it has been argued that gene synthesis will allow researchers to source genes from organisms that would otherwise have been incompatible, opening up moredistant parts of the tree of life as sources for transgenes. We test if this hypothesis is true for academic biological research using a 10-year data set from Addgene, the non-profit plasmid repository. We observe ~19,000 unique genes deposited to Addgene and classify them by whether they are natural or synthetic using a nucleotide-only technique that we develop here. We find that synthetic genes are an increasing share of Addgene deposits. Most importantly, we find direct evidence that researchers are using gene synthesis to source genes from more genetically-distant organisms, particularly for genes that are longer and thus might otherwise be particularly challenging to express. Thus, we provide the first empirical evidence that gene synthesis is leading biologists and bioengineers to sample more broadly across the rich genetic diversity of life, increasingly making that functionality available for industrial or biomedical advances.peer-reviewed)
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.