2021
DOI: 10.1093/bioinformatics/btab801
|View full text |Cite
|
Sign up to set email alerts
|

NetSolP: predicting protein solubility in Escherichia coli using language models

Abstract: Motivation Solubility and expression levels of proteins can be a limiting factor for large-scale studies and industrial production. By determining the solubility and expression directly from the protein sequence, the success rate of wet-lab experiments can be increased. Results In this study, we focus on predicting the solubility and usability for purification of proteins expressed in Escherichia coli directly from the sequen… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
36
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 33 publications
(36 citation statements)
references
References 39 publications
0
36
0
Order By: Relevance
“…Self-supervised training endows the latent variables of the model with highly informative features, known as learned representations, which can then be leveraged in downstream tasks where limited training data is available. Learned protein representations are currently central to the state-of-the-art tools for predicting variant fitness [3][4][5][6], protein function [7,8], subcellular localisation [9], solubility [10], binding sites [11], signal peptides [12], post-translational modifications [13], intrinsic disorder [14], and others [15,16], and they have shown promise in the path towards accurate alignment-free protein structure prediction [17][18][19][20][21]. Improving learned representations is therefore a potential path to deliver consistent, substantial improvements across computational protein engineering.…”
Section: Introductionmentioning
confidence: 99%
“…Self-supervised training endows the latent variables of the model with highly informative features, known as learned representations, which can then be leveraged in downstream tasks where limited training data is available. Learned protein representations are currently central to the state-of-the-art tools for predicting variant fitness [3][4][5][6], protein function [7,8], subcellular localisation [9], solubility [10], binding sites [11], signal peptides [12], post-translational modifications [13], intrinsic disorder [14], and others [15,16], and they have shown promise in the path towards accurate alignment-free protein structure prediction [17][18][19][20][21]. Improving learned representations is therefore a potential path to deliver consistent, substantial improvements across computational protein engineering.…”
Section: Introductionmentioning
confidence: 99%
“…Therefore, the strategy of taking the difference between the predicted labels for a reference protein and its variant typically fails to produce reliable predictors for mutational effects. 130 A more promising route is to use labeled mutational data sets for training. This strategy has its own limitations, since such data sets are not only scarce but also sparse in terms of the extent of the mutational landscape that is probed (the sequence space grows exponentially with the number of mutated residues) and biased toward several overrepresented proteins.…”
Section: Supervised Learning To Predict the Effects Of Mutationsmentioning
confidence: 99%
“…, solubility scores), in contrast to dramatic changes often observed in experiments. Therefore, the strategy of taking the difference between the predicted labels for a reference protein and its variant typically fails to produce reliable predictors for mutational effects …”
Section: Protein Engineering Tasks Solved By Machine Learningmentioning
confidence: 99%
“…End-to-end models based on deep learning that predict various structural features [e.g., secondary structure type and content (28)(29)(30)(31)(32)(33), binding sites (34), and surfaces (35)] and properties [e.g., solubility (16,36,37), melting temperature (38), natural vibrational frequencies (39,40), and strength (41)] for given sequences have also been reported. At the sample time, the inverse design of de novo proteins that meet desired structural or property features presents a more challenging task.…”
Section: Introductionmentioning
confidence: 99%