BackgroundRecombinant protein production is a useful biotechnology to produce a large quantity of highly soluble proteins. Currently, the most widely used production system is to fuse a target protein into different vectors in Escherichia coli (E. coli). However, the production efficacy of different vectors varies for different target proteins. Trial-and-error is still the common practice to find out the efficacy of a vector for a given target protein. Previous studies are limited in that they assumed that proteins would be over-expressed and focused only on the solubility of expressed proteins. In fact, many pairings of vectors and proteins result in no expression.ResultsIn this study, we applied machine learning to train prediction models to predict whether a pairing of vector-protein will express or not express in E. coli. For expressed cases, the models further predict whether the expressed proteins would be soluble. We collected a set of real cases from the clients of our recombinant protein production core facility, where six different vectors were designed and studied. This set of cases is used in both training and evaluation of our models. We evaluate three different models based on the support vector machines (SVM) and their ensembles. Unlike many previous works, these models consider the sequence of the target protein as well as the sequence of the whole fusion vector as the features. We show that a model that classifies a case into one of the three classes (no expression, inclusion body and soluble) outperforms a model that considers the nested structure of the three classes, while a model that can take advantage of the hierarchical structure of the three classes performs slight worse but comparably to the best model. Meanwhile, compared to previous works, we show that the prediction accuracy of our best method still performs the best. Lastly, we briefly present two methods to use the trained model in the design of the recombinant protein production systems to improve the chance of high soluble protein production.ConclusionIn this paper, we show that a machine learning approach to the prediction of the efficacy of a vector for a target protein in a recombinant protein production system is promising and may compliment traditional knowledge-driven study of the efficacy. We will release our program to share with other labs in the public domain when this paper is published.
BackgroundUn-MAppable Reads Solution (UMARS) is a user-friendly web service focusing on retrieving valuable information from sequence reads that cannot be mapped back to reference genomes. Recently, next-generation sequencing (NGS) technology has emerged as a powerful tool for generating high-throughput sequencing data and has been applied to many kinds of biological research. In a typical analysis, adaptor-trimmed NGS reads were first mapped back to reference sequences, including genomes or transcripts. However, a fraction of NGS reads failed to be mapped back to the reference sequences. Such un-mappable reads are usually imputed to sequencing errors and discarded without further consideration.MethodsWe are investigating possible biological relevance and possible sources of un-mappable reads. Therefore, we developed UMARS to scan for virus genomic fragments or exon-exon junctions of novel alternative splicing isoforms from un-mappable reads. For mapping un-mappable reads, we first collected viral genomes and sequences of exon-exon junctions. Then, we constructed UMARS pipeline as an automatic alignment interface.ResultsBy demonstrating the results of two UMARS alignment cases, we show the applicability of UMARS. We first showed that the expected EBV genomic fragments can be detected by UMARS. Second, we also detected exon-exon junctions from un-mappable reads. Further experimental validation also ensured the authenticity of the UMARS pipeline. The UMARS service is freely available to the academic community and can be accessed via http://musk.ibms.sinica.edu.tw/UMARS/.ConclusionsIn this study, we have shown that some un-mappable reads are not caused by sequencing errors. They can originate from viral infection or transcript splicing. Our UMARS pipeline provides another way to examine and recycle the un-mappable reads that are commonly discarded as garbage.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.