CATH functional families predict protein functional sites

Das, Sayoni; Scholes, Harry M.; Orengo, Christine

doi:10.1101/2020.03.23.003012

Cited by 2 publications

(3 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…FunFams can be used to predict function on a per-protein level as described through Gene Ontology (GO) terms [6,7], to predict functional sites [8], or to improve binding residue predictions by combining predictions of one FunFam in a consensus prediction [9].…”

Section: Introductionmentioning

confidence: 99%

“…While proteins in one superfamily can still be functionally and structurally diverse, Functional Families (FunFams) further subclassify superfamilies into coherent subsets of proteins with the same function. FunFams can be used to predict function on a per-protein level as described through Gene Ontology (GO) terms [6,7], to predict functional sites [8], or to improve binding residue predictions through consensus [9].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Clustering FunFams using sequence embeddings improves EC purity

Littmann

Bordin

Heinzinger

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

MotivationClassifying proteins into functional families can improve our understanding of a protein’s function and can allow transferring annotations within the same family. Toward this end, functional families need to be “pure”, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function, based on differentially conserved residues. 11% of all FunFams (22,830 of 203,639) also contain EC annotations and of those, 7% (1,526 of 22,830) have at least two different EC annotations, i.e., inconsistent functional annotations.ResultsWe propose an approach to further cluster FunFams into smaller and functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from deep learned language models (LMs) transferring the knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between sequences in embedding space and DBSCAN to cluster FunFams, as well as identify outlier sequences, resulted in twice as many more pure clusters per FunFam than for a random clustering. 52% of the impure FunFams were split into pure clusters, four times more than for random. While functional consistency was mainly measured using EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other definitions of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency can be used to infer annotations more reliably. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes.AvailabilityThe source code and PB-Tucker embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Clustering FunFams using sequence embeddings improves EC purity

Littmann

Bordin

Heinzinger

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…While proteins in one superfamily can still be functionally and structurally diverse, Functional Families (FunFams) further sub-classify superfamilies into coherent subsets of proteins with the same function. FunFams can be used to predict function on a per-protein level as described through Gene Ontology (GO) terms ( Das et al , 2015 ; Zhou et al , 2019 ), to predict functional sites ( Das et al , 2020 ), or to improve binding residue predictions through consensus ( Scheibenreif et al , 2019 ).…”

Section: Introductionmentioning

confidence: 99%

Clustering FunFams using sequence embeddings improves EC purity

Littmann¹,

Bordin

Heinzinger³

et al. 2021

Bioinformatics

View full text Add to dashboard Cite

Motivation Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be “pure”, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22,830 of 203,639) contain EC annotations and of those, 7% (1,526 of 22,830) have inconsistent functional annotations. Results We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. Availability Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering Supplementary information Supplementary data are available online.

show abstract

CATH functional families predict protein functional sites

Cited by 2 publications

References 57 publications

Clustering FunFams using sequence embeddings improves EC purity

Clustering FunFams using sequence embeddings improves EC purity

Clustering FunFams using sequence embeddings improves EC purity

Contact Info

Product

Resources

About