BackgroundSince the recombinant protein was discovered, it has become more popular in many aspects of life science. The value of global pharmaceutical market was $87 billion in 2008 and the sales for industrial enzyme exceeded $4 billion in 2012. This is strong evidence showing the great potential of recombinant protein. However, native genes introduced into a host can cause incompatibility of codon usage bias, GC content, repeat region, Shine-Dalgarno sequence with host’s expression system, so the yields can fall down significantly. Hence, we propose novel methods for gene optimization based on neural network, Bayesian theory, and Euclidian distance.ResultThe correlation coefficients of our neural network are 0.86, 0.73, and 0.90 in training, validation, and testing process. In addition, genes optimized by our methods seem to associate with highly expressed genes and give reasonable codon adaptation index values. Furthermore, genes optimized by the proposed methods are highly matched with the previous experimental data.ConclusionThe proposed methods have high potential for gene optimization and further researches in gene expression. We built a demonstrative program using Matlab R2014a under Mac OS X. The program was published in both standalone executable program and Matlab function files. The developed program can be accessed from http://www.math.hcmus.edu.vn/~ptbao/paper_soft/GeneOptProg/.
Species identification is beneficial for many aspects of life and scientific research, but the experiment method based on biochemistry may be subjective and inaccuracy in several cases. In order to solve this problem, searching genes in the database is one of the most effective and accurate methods for identification of the Bacillus. However, in the case of the incomplete database, the searching algorithm cannot identify genes which are not in the database. Thus, in this research, we proposed a novel feature to identify the Bacillus based on their codon usage bias, called relative synonymous codon pair usage (RSCPU). We extracted this feature from genes collected from National Center for Biotechnology Information (NCBI) website; then, K -means clustering and Support Vector Machine were applied to classify genes vectored. Finally, we used this method for Bacillus identification and obtained a result that our accuracy is about 3 times (2.93) higher than past research [1].
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.