2022
DOI: 10.1038/s41587-021-01179-w
|View full text |Cite
|
Sign up to set email alerts
|

Using deep learning to annotate the protein universe

Abstract: Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate 1/3 of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. In this paper, we explore an alternative methodology based on deep learning that learns the relationship between unaligned amino acid sequences and their … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

6
185
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1

Relationship

1
6

Authors

Journals

citations
Cited by 200 publications
(191 citation statements)
references
References 49 publications
6
185
0
Order By: Relevance
“…The larger the family, the more likely it is to contain sequences that bridge together otherwise dissimilar clusters, so the procedure fails more often on alignments with many sequences. This is a concern because we and others are exploring increasingly complex and parameter-rich models for remote sequence homology recognition that can require thousands of sequences for training [8][9][10][11][12][13]. A phylogenetic approach that attempts to identify an out-group would face this same "bridge" issue.…”
Section: Introductionmentioning
confidence: 99%
“…The larger the family, the more likely it is to contain sequences that bridge together otherwise dissimilar clusters, so the procedure fails more often on alignments with many sequences. This is a concern because we and others are exploring increasingly complex and parameter-rich models for remote sequence homology recognition that can require thousands of sequences for training [8][9][10][11][12][13]. A phylogenetic approach that attempts to identify an out-group would face this same "bridge" issue.…”
Section: Introductionmentioning
confidence: 99%
“…Using the train-test split in [4] for the sequences belonging to the 10000 lens families, we train a lens architecture to directly perform family classification for these families. We measure the accuracy of this trained lens architecture on the test subset of the lens families, which we refer to as Lens Accuracy.…”
Section: Lens Trainingmentioning
confidence: 99%
“…We thus take the lens architecture, trained using lens training, and use it to compute fixed-length embeddings for the sequences belonging to the evaluation families. Again using the train-test split in [4], but this time for the sequences belonging to the 1000 evaluation families, we fit a 1-NN family classifier using the sequence embeddings. We fit using at most n samples per family from the train subset of the evaluation families and consider n ∈ {1, 5, 10, 50} to see how well our embeddings perform with limited data.…”
Section: Evaluating Protein Embeddings For Nearest-neighbor Retrievalmentioning
confidence: 99%
See 2 more Smart Citations