Functions of proteins are annotated by Gene Ontology (GO) terms. As the amount of new sequences being collected is rising at a faster pace than the number of sequences being annotated with GO terms, there have been efforts to develop better annotation techniques. When annotating protein sequences with GO terms, one key auxiliary resource is the GO data itself. GO terms have definitions consisted of a few sentences describing some biological events, and are also arranged in a tree structure with specific terms being child nodes of generic terms. The definitions and positions of the GO terms on the GO tree can be used to construct vector representations for the GO terms. These GO vectors can then be integrated into existing prediction models to improve the classification accuracy. In this paper, we adapt the Bidirectional Encoder Representations from Transformers (BERT) to encode GO definitions into vectors. We evaluate BERT against the previous GO encoders in three tasks: (1) measuring similarity between GO terms (2) asserting relationship for orthologs and interacting proteins based on their GO annotations and (3) predicting GO terms for protein sequences. For task 3, we show that using GO vectors as additional prediction features increases the accuracy, primarily for GO terms with low occurrences in the manually annotated dataset. In all three tasks, BERT often outperforms the previous GO encoders.https://www.uniprot.org/
Predicting missing facts in a knowledge graph (KG) is a crucial task in knowledge base construction and reasoning, and it has been the subject of much research in recent works using KG embeddings. While existing KG embedding approaches mainly learn and predict facts within a single KG, a more plausible solution would benefit from the knowledge in multiple language-specific KGs, considering that different KGs have their own strengths and limitations on data quality and coverage. This is quite challenging, since the transfer of knowledge among multiple independently maintained KGs is often hindered by the insufficiency of alignment information and the inconsistency of described facts. In this paper, we propose KEnS, a novel framework for embedding learning and ensemble knowledge transfer across a number of languagespecific KGs. KEnS embeds all KGs in a shared embedding space, where the association of entities is captured based on selflearning. Then, KEnS performs ensemble inference to combine prediction results from embeddings of multiple language-specific KGs, for which multiple ensemble techniques are investigated. Experiments on five real-world language-specific KGs show that KEnS consistently improves state-of-the-art methods on KG completion, via effectively identifying and leveraging complementary knowledge.
7Predicting functions for novel amino acid sequences is a long-standing research problem. The 8 Uniprot database which contains protein sequences annotated with Gene Ontology (GO) terms, 9 is one commonly used training dataset for this problem. Predicting protein functions can then 10 be viewed as a multi-label classification problem where the input is an amino acid sequence and 11 the output is a set of GO terms. Recently, deep convolutional neural network (CNN) models have 12 been introduced to annotate GO terms for protein sequences. However, the CNN architecture 13 can only model close-range interactions between amino acids in a sequence. In this paper, first, 14 we build a novel GO annotation model based on the Transformer neural network. Unlike the 15 CNN architecture, the Transformer models all pairwise interactions for the amino acids within a 16 sequence, and so can capture more relevant information from the sequences. Indeed, we show 17 that our adaptation of Transformer yields higher classification accuracy when compared to the 18 recent CNN-based method DeepGO. Second, we modify our model to take motifs in the protein 19 sequences found by BLAST as additional input features. Our strategy is different from other 20 ensemble approaches that average the outcomes of BLAST-based and machine learning predictors. 21 Third, we integrate into our Transformer the metadata about the protein sequences such as 3D 22 structure and protein-protein interaction (PPI) data. We show that such information can greatly 23 improve the prediction accuracy, especially for rare GO labels. 24 1 Introduction 25 Predicting protein functions is an important task in computational biology. With the cost of 26 sequencing continuing to decrease, the gap between the numbers of labeled and unlabeled 27 sequences continues to grow [18]. Protein functions are described by Gene Ontology (GO) 28 terms [16]. Predicting protein functions is a multi-label classification problem where the 29 input is an amino acid sequence and the output is a set of GO terms. GO terms are 30 organized into a hierarchical tree, where generic terms (e.g. cellular anatomical entity) 31 are parents of specific terms (e.g. perforation plate). Due to this tree structure, if a GO 32 term is assigned to a protein, then all its ancestors are also assigned to this same protein. 33 When analyzing only the amino acid sequence data to predict protein functions, there 34 are two major trends. The first trend relies on string-matching models like Basic Local 35 Alignment Search Tool (BLAST) to match the unknown sequence with labeled proteins 36 in the database [11]. Recently, Zhang et al. [18] combined BLAST with Position-Specific 37 Iterative Basic Local Alignment Search Tool (PSI-BLAST) to retrieve even more labeled 38 1 proteins which are possibly related to the unknown sequence. The key idea behind 39BLAST methods is to retrieve proteins that resemble the unknown sequence; most likely, 40 these retrieved proteins will contain similar evolutionarily co...
Coreference resolution is an important component in analyzing narrative text from administrative data (e.g., clinical or police sources). However, existing coreference models trained on general language corpora suffer from poor transferability due to domain gaps, especially when they are applied to gender-inclusive data with lesbian, gay, bisexual, and transgender (LGBT) individuals. In this paper, we analyzed the challenges of coreference resolution in an exemplary form of administrative text written in English: violent death narratives from the USA's Centers for Disease Control's (CDC) National Violent Death Reporting System. We developed a set of data augmentation rules to improve model performance using a probabilistic data programming framework. Experiments on narratives from an administrative database, as well as existing gender-inclusive coreference datasets, demonstrate the effectiveness of data augmentation in training coreference models that can better handle text data about LGBT individuals.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.