Multi-task learning to leverage partially annotated data for PPI interface prediction

Capel, Henriette; Feenstra, K. Anton; Abeln, Sanne

doi:10.1038/s41598-022-13951-2

Cited by 9 publications

(9 citation statements)

References 62 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Firstly, there isn’t always a clear reason for a head-on comparison with other methods. You may, for example, be setting out to find the added value (or not) of specific parts of your training procedure (e.g., [ 15 ]) or of the architecture (e.g., [ 16 , 17 ]). First point of business will be to identify the current state of the art, which you can usually find in a recent benchmarking review.…”

Section: Introductionmentioning

confidence: 99%

“…We also introduced a broader benchmark set ProteinGLUE including mutiple prediction tasks: secondary structure, solvent accessibility, PPI, epitopes, and hydrophobic patch prediction [ 16 ]. Many method papers will also include an update of latest developments (e.g., [ 15 , 44 ]).…”

Section: Introductionmentioning

confidence: 99%

“…Predicting protein functional properties is still one of the most important tasks for bioinformaticians (e.g., [10][11][12][13][14][15][16][17]). Here, we collect 10 useful tips or guidelines representing best practices specifically for methods that generate predictions of protein functional structural properties using protein sequence data as input; Fig 1 illustrates several examples.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Ten quick tips for sequence-based prediction of protein properties using machine learning

et al. 2022

Self Cite

View full text Add to dashboard Cite

The ubiquitous availability of genome sequencing data explains the popularity of machine learning-based methods for the prediction of protein properties from their amino acid sequences. Over the years, while revising our own work, reading submitted manuscripts as well as published papers, we have noticed several recurring issues, which make some reported findings hard to understand and replicate. We suspect this may be due to biologists being unfamiliar with machine learning methodology, or conversely, machine learning experts may miss some of the knowledge needed to correctly apply their methods to proteins. Here, we aim to bridge this gap for developers of such methods. The most striking issues are linked to a lack of clarity: how were annotations of interest obtained; which benchmark metrics were used; how are positives and negatives defined. Others relate to a lack of rigor: If you sneak in structural information, your method is not sequence-based; if you compare your own model to “state-of-the-art,” take the best methods; if you want to conclude that some method is better than another, obtain a significance estimate to support this claim. These, and other issues, we will cover in detail. These points may have seemed obvious to the authors during writing; however, they are not always clear-cut to the readers. We also expect many of these tips to hold for other machine learning-based applications in biology. Therefore, many computational biologists who develop methods in this particular subject will benefit from a concise overview of what to avoid and what to do instead.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Ten quick tips for sequence-based prediction of protein properties using machine learning

et al. 2022

Self Cite

View full text Add to dashboard Cite

show abstract

“…Generally, an MTL model can be trained by linearly combining loss functions from different tasks into a single total loss function [15]. In this way, the model can learn a shared representation for all tasks by stochastic gradient descent (SGD) with back-propagation [15,43]. Ordinarily, assuming that there are M tasks in all, the global loss function can be defined as where L i represents task-specific loss function, and w i denotes weights assigned for each L i .…”

Section: Details Of Mtl Architecturementioning

confidence: 99%

Collectively encoding protein properties enriches protein language models

Weng

2022

BMC Bioinformatics

View full text Add to dashboard Cite

Pre-trained natural language processing models on a large natural language corpus can naturally transfer learned knowledge to protein domains by fine-tuning specific in-domain tasks. However, few studies focused on enriching such protein language models by jointly learning protein properties from strongly-correlated protein tasks. Here we elaborately designed a multi-task learning (MTL) architecture, aiming to decipher implicit structural and evolutionary information from three sequence-level classification tasks for protein family, superfamily and fold. Considering the co-existing contextual relevance between human words and protein language, we employed BERT, pre-trained on a large natural language corpus, as our backbone to handle protein sequences. More importantly, the encoded knowledge obtained in the MTL stage can be well transferred to more fine-grained downstream tasks of TAPE. Experiments on structure- or evolution-related applications demonstrate that our approach outperforms many state-of-the-art Transformer-based protein models, especially in remote homology detection.

show abstract

“…Many techniques for protein structure prediction have been intensively studied in recent years, and scientists have developed an increasing number of creative models to boost prediction performance. Database annotation and sequence-based approaches are the two main approaches used in this area [2]. In order to make predictions, sequence-based attempts to extract unique features from protein sequences.…”

Section: Introductionmentioning

confidence: 99%

Comparative Study on Feature Selection Methods for Protein

Alkady

El-Bahnasy

Gad

2022

IJICIS

View full text Add to dashboard Cite

The automated and high-throughput identification of protein function is one of the main issues in computational biology. Predicting the protein's structure is a crucial step in this procedure. In recent years, a wide range of approaches for predicting protein structure has been put forth. They can be divided into two groups: database-based and sequence-based. The first is to identify the principles behind protein structure and attempts to extract valuable characteristics from amino acid sequences. The second one uses pre-existing public annotation databases for data mining. This study emphasizes the sequence-based method and makes use of the ability of amino acid sequences to predict protein activity. The amino acid composition approach, the amino acid tuple approach, and several optimization algorithms were compared. Different protein sequence data sets were used in our experiments. Five classifiers were tested in this research. The best accuracy is 98% using across 10fold cross-validation. This represents the highest performance in the Human dataset.

show abstract

Multi-task learning to leverage partially annotated data for PPI interface prediction

Cited by 9 publications

References 62 publications

Ten quick tips for sequence-based prediction of protein properties using machine learning

Ten quick tips for sequence-based prediction of protein properties using machine learning

Collectively encoding protein properties enriches protein language models

Comparative Study on Feature Selection Methods for Protein

Contact Info

Product

Resources

About