2022
DOI: 10.1371/journal.pcbi.1010669
|View full text |Cite
|
Sign up to set email alerts
|

Ten quick tips for sequence-based prediction of protein properties using machine learning

Abstract: The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

1
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 13 publications
(4 citation statements)
references
References 54 publications
1
3
0
Order By: Relevance
“…To see how the performances of each of StrucTFactor and DeepTFactor vary with different sequence redundancy, we compare them on datasets with sequence non-redundant (pairwise sequence similarity of < 30%) proteins, i.e., D 1 , D 2 , or D 3 , as well as on datasets with sequence redundant (pairwise sequence similarity of ≥ 30%) proteins, i.e., D 4 , D 5 , or D 6 . In general, because high sequence similarities in a dataset can artificially boost the performances of trained machine learning models (Hou et al, 2022), we expect to see an increase in the performance for both StrucTFactor and DeepTFactor with an increase in sequence redundancy, which is exactly what we find (Table 1). For example, with respect to MCC, the performance for DeepTFactor is ∼60% for D 1 (a sequence non-redundant dataset), while it is ∼96.5% for the corresponding sequence redundant dataset D 4 .s…”
Section: Resultssupporting
confidence: 76%
“…To see how the performances of each of StrucTFactor and DeepTFactor vary with different sequence redundancy, we compare them on datasets with sequence non-redundant (pairwise sequence similarity of < 30%) proteins, i.e., D 1 , D 2 , or D 3 , as well as on datasets with sequence redundant (pairwise sequence similarity of ≥ 30%) proteins, i.e., D 4 , D 5 , or D 6 . In general, because high sequence similarities in a dataset can artificially boost the performances of trained machine learning models (Hou et al, 2022), we expect to see an increase in the performance for both StrucTFactor and DeepTFactor with an increase in sequence redundancy, which is exactly what we find (Table 1). For example, with respect to MCC, the performance for DeepTFactor is ∼60% for D 1 (a sequence non-redundant dataset), while it is ∼96.5% for the corresponding sequence redundant dataset D 4 .s…”
Section: Resultssupporting
confidence: 76%
“…In the absence of such studies, in silico PPI analyses are difficult to reconcile, the development of new models is inefficient, follow-up mechanisms studies are likely undermined and, ultimately, there are different versions of the underlying molecular networks that describe protein function. A range of publications have investigated best practices for machine learning in biology ( Chicco 2017 , Greener et al 2022 , Hou et al 2022 , Lee et al 2022 ) and highlighted that replicable, trustworthy, and generalizable high-performing models can capture more causal biology and enhance many aspects of biological research such as experimental designs and drug development.…”
Section: Introductionmentioning
confidence: 99%
“…Deep learning-based methods are effective for classification problems where such knowledge is not available in advance. Transformer, a deep-learning-based method that embeds natural language into vectors, was recently adopted for various classification problems in biology and has shown promising performances ( Vaswani et al 2017 , Hou et al 2022 ). Such language model-based tools include ProtTrans and ESM-2, which represent amino-acid sequences by vectors that can be used as inputs for various machine-learning methods ( Elnaggar et al 2022 , Lin et al 2023 ).…”
Section: Introductionmentioning
confidence: 99%