2020
DOI: 10.1093/bioinformatics/btaa003
|View full text |Cite
|
Sign up to set email alerts
|

UDSMProt: universal deep sequence models for protein classification

Abstract: Motivation Inferring the properties of a protein from its amino acid sequence is one of the key problems in bioinformatics. Most state-of-the-art approaches for protein classification are tailored to single classification tasks and rely on handcrafted features, such as position-specific-scoring matrices from expensive database searches. We argue that this level of performance can be reached or even be surpassed by learning a task-agnostic representation once, using self-supervised language mo… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
91
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3

Relationship

1
5

Authors

Journals

citations
Cited by 147 publications
(91 citation statements)
references
References 31 publications
0
91
0
Order By: Relevance
“…This observation is interesting considering the fact that for alleles with fewer than 1000 training measurements, MHCFlurry was pretrained on an augmented training set with measurements from BLOSUM similar alleles, USMPep_LM_ens was pretrained on a large corpus of unlabeled peptides and USMPep_FS_ens in contrast only saw the training sequences corresponding to one MHC molecule. These results stress that further efforts might be required to truly leverage the potential of unlabeled peptide data in order to observe similar improvements as seen for proteins [15] in particular for small datasets.…”
Section: Kim14 Datasetmentioning
confidence: 59%
See 3 more Smart Citations
“…This observation is interesting considering the fact that for alleles with fewer than 1000 training measurements, MHCFlurry was pretrained on an augmented training set with measurements from BLOSUM similar alleles, USMPep_LM_ens was pretrained on a large corpus of unlabeled peptides and USMPep_FS_ens in contrast only saw the training sequences corresponding to one MHC molecule. These results stress that further efforts might be required to truly leverage the potential of unlabeled peptide data in order to observe similar improvements as seen for proteins [15] in particular for small datasets.…”
Section: Kim14 Datasetmentioning
confidence: 59%
“…The approach builds on the UDSMProt-framework [15] and related work in natural language processing [16]. We distinguish two variants of our approach, either train the regression from scratch or employ language model pretraining.…”
Section: Usmpep: Universal Sequence Models For Peptide Binding Predicmentioning
confidence: 99%
See 2 more Smart Citations
“…[14] This type of encoding method has been demonstrated to be extremely useful in certain tasks. [12,[14][15][16] Before encoding a sequence as dense numeric vectors, the sequence is typically represented as an integer vector in which each token is represented by a unique integer. The final method is to design handcrafted features and then take these features as input for modeling.…”
Section: Basic Concepts In Deep Learningmentioning
confidence: 99%