BackgroundA major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging.ResultsWe conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2.ConclusionsThe top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.Electronic supplementary materialThe online version of this article (doi:10.1186/s13059-016-1037-6) contains supplementary material, which is available to authorized users.
ProtoNet 6.0 (http://www.protonet.cs.huji.ac.il) is a data structure of protein families that cover the protein sequence space. These families are generated through an unsupervised bottom–up clustering algorithm. This algorithm organizes large sets of proteins in a hierarchical tree that yields high-quality protein families. The 2012 ProtoNet (Version 6.0) tree includes over 9 million proteins of which 5.5% come from UniProtKB/SwissProt and the rest from UniProtKB/TrEMBL. The hierarchical tree structure is based on an all-against-all comparison of 2.5 million representatives of UniRef50. Rigorous annotation-based quality tests prune the tree to most informative 162 088 clusters. Every high-quality cluster is assigned a ProtoName that reflects the most significant annotations of its proteins. These annotations are dominated by GO terms, UniProt/Swiss-Prot keywords and InterPro. ProtoNet 6.0 operates in a default mode. When used in the advanced mode, this data structure offers the user a view of the family tree at any desired level of resolution. Systematic comparisons with previous versions of ProtoNet are carried out. They show how our view of protein families evolves, as larger parts of the sequence space become known. ProtoNet 6.0 provides numerous tools to navigate the hierarchy of clusters.
Motivation: Modern protein sequencing techniques have led to the determination of >50 million protein sequences. ProtoNet is a clustering system that provides a continuous hierarchical agglomerative clustering tree for all proteins. While ProtoNet performs unsupervised classification of all included proteins, finding an optimal level of granularity for the purpose of focusing on protein functional groups remain elusive. Here, we ask whether knowledge-based annotations on protein families can support the automatic unsupervised methods for identifying high-quality protein families. We present a method that yields within the ProtoNet hierarchy an optimal partition of clusters, relative to manual annotation schemes. The method’s principle is to minimize the entropy-derived distance between annotation-based partitions and all available hierarchical partitions. We describe the best front (BF) partition of 2 478 328 proteins from UniRef50. Of 4 929 553 ProtoNet tree clusters, BF based on Pfam annotations contain 26 891 clusters. The high quality of the partition is validated by the close correspondence with the set of clusters that best describe thousands of keywords of Pfam. The BF is shown to be superior to naïve cut in the ProtoNet tree that yields a similar number of clusters. Finally, we used parameters intrinsic to the clustering process to enrich a priori the BF’s clusters. We present the entropy-based method’s benefit in overcoming the unavoidable limitations of nested clusters in ProtoNet. We suggest that this automatic information-based cluster selection can be useful for other large-scale annotation schemes, as well as for systematically testing and comparing putative families derived from alternative clustering methods.Availability and implementation: A catalog of BF clusters for thousands of Pfam keywords is provided at http://protonet.cs.huji.ac.il/bestFront/Contact: michall@cc.huji.ac.il
Hypothyroidism is a common disorder of the endocrine system in which the thyroid gland does not produce enough thyroid hormones. About 12% of the population in the USA will develop substantial thyroid deficiency over their lifetime, mostly as a result of iodine deficiency. The hypothyroidism phenotype also includes individuals that suffer from thyroid development abnormalities (congenital hypothyroidism, CH). Using a large population study, we aimed to identify the functional genes associated with an increase or decreased risk for hypothyroidism (ICD-10, E03). To this end, we used the gene-based proteome-wide association study (PWAS) method to detect associations mediated by the effects of variants on the protein function of all coding genes. The UK-Biobank (UKB) reports on 13,687 cases out of 274,824 participants of European ancestry, with a prevalence of 7.5% and 2.0% for females and males, respectively. The results from PWAS for ICD-10 E03 are a ranked list of 77 statistically significant genes (FDR-q-value <0.05) and an extended list of 95 genes with a weaker threshold (FDR-q-value <0.1). Validation was performed using the FinnGen Freeze 7 (Fz7) database across several GWAS with 33.5k to 44.5k cases. We validated 9 highly significant genes across the two independent cohorts. About 12% of the PWAS reported genes are strictly associated with a recessive inheritance model that is mostly overlooked by GWAS. Furthermore, PWAS performed by sex stratification identified 9 genes in males and 63 genes in females. However, resampling and statistical permutation tests confirmed that the genes involved in hypothyroidism are common to both sexes. Many of these genes function in the recognition and response of immune cells, with a strong signature of autoimmunity. Additional genetic association protocols, including PWAS, TWAS (transcriptional WAS), Open Targets (OT, unified GWAS) and coding-GWAS, revealed the complex etiology of hypothyroidism. Each association method highlights a different facet of the disease, including the developmental program of CH, autoimmunity, gene dysregulation, and sex-related gene enrichment. We conclude that genome association methods are complementary while each one reveals different aspects of hypothyroidism. Applying a multiple-protocol approach to complex diseases is expected to improve interpretability and clinical utility.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.