Ten quick tips for sequence-based prediction of protein properties using machine learning

Hou, Qingzhen; Waury, Katharina; Gogishvili, Dea; Feenstra, K. Anton

doi:10.1371/journal.pcbi.1010669

Cited by 13 publications

(4 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To see how the performances of each of StrucTFactor and DeepTFactor vary with different sequence redundancy, we compare them on datasets with sequence non-redundant (pairwise sequence similarity of < 30%) proteins, i.e., D 1 , D 2 , or D 3 , as well as on datasets with sequence redundant (pairwise sequence similarity of ≥ 30%) proteins, i.e., D 4 , D 5 , or D 6 . In general, because high sequence similarities in a dataset can artificially boost the performances of trained machine learning models (Hou et al, 2022), we expect to see an increase in the performance for both StrucTFactor and DeepTFactor with an increase in sequence redundancy, which is exactly what we find (Table 1). For example, with respect to MCC, the performance for DeepTFactor is ∼60% for D 1 (a sequence non-redundant dataset), while it is ∼96.5% for the corresponding sequence redundant dataset D 4 .s…”

Section: Resultssupporting

confidence: 76%

Transcription factor prediction using protein 3D secondary structures

Liebold,

Neuhaus,

Geiser

et al. 2024

Preprint

View full text Add to dashboard Cite

Motivation: Transcription factors (TFs) are DNA-binding proteins that regulate expressions of genes in an organism. Hence, it is important to identify novel TFs. Traditionally, novel TFs have been identified by their sequence similarity to the DNA-binding domains (DBDs) of known TFs. However, this approach can miss to identify a novel TF that is not sequence similar to any of the known DBDs. Hence, computational methods have been developed for the TF prediction task that, instead of relying on known DBDs, use sequence features of proteins to train a machine learning model, in order to capture sequence patterns that distinguish TFs from other proteins. Because 3-dimensional (3D) structure of a protein captures more information than its sequence, using 3D protein structures can more correctly predict novel TFs. Results: We propose the first deep learning-based TF prediction method (named StrucTFactor) based on 3D protein structures. We compare StrucTFactor with a recent state-of-the-art TF prediction method that relies only on protein sequences. We evaluate the considered methods on ~550,000 proteins across 12 datasets, capturing different aspects of data bias (including sequence redundancy and 3D protein structural quality) that can influence a method's performance. We find that StrucTFactor significantly (p-value < 0.001) outperforms the existing state-of-the-art TF prediction method, improving performance by up to 23% based on Matthews correlation coefficient. Our results show the importance of using 3D protein structures to predict novel TFs. We provide StrucTFactor as a computational pipeline.

show abstract

Section: Resultssupporting

confidence: 76%

Transcription factor prediction using protein 3D secondary structures

Liebold,

Neuhaus,

Geiser

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…In the absence of such studies, in silico PPI analyses are difficult to reconcile, the development of new models is inefficient, follow-up mechanisms studies are likely undermined and, ultimately, there are different versions of the underlying molecular networks that describe protein function. A range of publications have investigated best practices for machine learning in biology ( Chicco 2017 , Greener et al 2022 , Hou et al 2022 , Lee et al 2022 ) and highlighted that replicable, trustworthy, and generalizable high-performing models can capture more causal biology and enhance many aspects of biological research such as experimental designs and drug development.…”

Section: Introductionmentioning

confidence: 99%

Pitfalls of machine learning models for protein–protein interaction networks

Lannelongue,

Inouye

2024

Bioinformatics

View full text Add to dashboard Cite

Motivation Protein-protein interactions (PPIs) are essential to understanding biological pathways as well as their roles in development and disease. Computational tools, based on classic machine learning, have been successful at predicting PPIs in silico, but the lack of consistent and reliable frameworks for this task has led to network models that are difficult to compare and discrepancies between algorithms that remain unexplained. Results To better understand the underlying inference mechanisms that underpin these models, we designed an open-source framework for benchmarking that accounts for a range of biological and statistical pitfalls while facilitating reproducibility. We use it to shed light on the impact of network topology and how different algorithms deal with highly connected proteins. By studying functional genomics-based and sequence-based models on human PPIs, we show their complementarity as the former performs best on lone proteins while the latter specialises in interactions involving hubs. We also show that algorithm design has little impact on performance with functional genomic data. We replicate our results between both human and S. cerevisiae data and demonstrate that models using functional genomics are better suited to PPI prediction across species. With rapidly increasing amounts of sequence and functional genomics data, our study provides a principled foundation for future construction, comparison and application of PPI networks. Availability The code and data are available on GitHub: https://github.com/Llannelongue/B4PPI Contact LL (LL582@medschl.cam.ac.uk) and MI (mi336@medschl.cam.ac.uk; minouye@baker.edu.au) Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

“…Deep learning-based methods are effective for classification problems where such knowledge is not available in advance. Transformer, a deep-learning-based method that embeds natural language into vectors, was recently adopted for various classification problems in biology and has shown promising performances ( Vaswani et al 2017 , Hou et al 2022 ). Such language model-based tools include ProtTrans and ESM-2, which represent amino-acid sequences by vectors that can be used as inputs for various machine-learning methods ( Elnaggar et al 2022 , Lin et al 2023 ).…”

Section: Introductionmentioning

confidence: 99%

Seq2Phase: language model-based accurate prediction of client proteins in liquid–liquid phase separation

Miyata,

Iwasaki

2023

Bioinformatics Advances

View full text Add to dashboard Cite

Motivation Liquid–liquid phase separation (LLPS) enables compartmentalization in cells without biological membranes. LLPS plays essential roles in membraneless organelles such as nucleoli and p-bodies, helps regulate cellular physiology, and is linked to amyloid formation. Two types of proteins, scaffolds and clients, are involved in LLPS. However, computational methods for predicting LLPS client proteins from amino-acid sequences remain underdeveloped. Results Here, we present Seq2Phase, an accurate predictor of LLPS client proteins. Information-rich features are extracted from amino-acid sequences by a deep-learning technique, Transformer, and fed into supervised machine learning. Predicted client proteins contained known LLPS regulators and showed localization enrichment into membraneless organelles, confirming the validity of the prediction. Feature analysis revealed that scaffolds and clients have different sequence properties and that textbook knowledge of LLPS-related proteins is biased and incomplete. Seq2Phase achieved high accuracies across human, mouse, yeast, and plant, showing that the method is not overfitted to specific species and has broad applicability. We predict that more than hundreds or thousands of LLPS client proteins remain undiscovered in each species and that Seq2Phase will advance our understanding of still enigmatic molecular and physiological bases of LLPS as well as its roles in disease. Availability The software codes in Python underlying this article are available at https://github.com/IwasakiLab/Seq2Phase. Supplementary information Supplementary data are available at Bioinformatics Advances online.

show abstract

Ten quick tips for sequence-based prediction of protein properties using machine learning

Cited by 13 publications

References 54 publications

Transcription factor prediction using protein 3D secondary structures

Transcription factor prediction using protein 3D secondary structures

Pitfalls of machine learning models for protein–protein interaction networks

Seq2Phase: language model-based accurate prediction of client proteins in liquid–liquid phase separation

Contact Info

Product

Resources

About