The scientific community is rapidly generating protein sequence information, but only a fraction of these proteins can be experimentally characterized. While promising deep learning approaches for protein prediction tasks have emerged, they have computational limitations or are designed to solve a specific task. We present a Transformer neural network that pre-trains task-agnostic sequence representations. This model is fine-tuned to solve two different protein prediction tasks: protein family classification and protein interaction prediction. Our method is comparable to existing state-of-the art approaches for protein family classification, while being much more general than other architectures. Further, our method outperforms all other approaches for protein interaction prediction. These results offer a promising framework for fine-tuning the pre-trained sequence representations for other protein prediction tasks.to identify functional characteristics is critical to understanding cellular functions as well as developing potential therapeutic applications [4]. Sequence-based methods to computationally infer protein characteristics have been critical for inferring protein function and other characteristics [5]. Thus, the development of computational methods to infer protein characteristics (which we generally describe as "protein prediction tasks") has become paramount in the field of bioinformatics and computational biology. Here, we develop a Transformer neural network to establish task-agnostic representations of protein sequences, and use the Transformer network to solve two protein prediction tasks. Background: Deep LearningDeep learning, a class of machine learning based on the use of artificial neural networks, has recently transformed the field of computational biology and medicine through its application towards long-standing problems such as image analysis, gene expression modeling, sequence variant calling, and putative drug discovery [6,7,8,9,10]. By leveraging deep learning, field specialists have been able to efficiently design and train models without the extensive feature engineering required by previous methods. In applying deep learning to sequence-based protein characterization tasks, we first consider the field of natural language processing (NLP), which aims to analyze human language through computational techniques [11]. Deep learning has recently proven to be a critical tool for NLP, achieving state-ofthe-art performance on benchmarks for named entity recognition, sentiment analysis, question answering, and text summarization, among others [12,13].Neural networks are functions that map one vector space to another. Thus, in order to use them for NLP tasks, we first need to represent words as real-valued vectors. Often referred to as word embeddings, these vector representations are typically "pre-trained" on an auxiliary task for which we have (or can automatically generate) a large amount of training data. The goal of this pre-training is to learn generically useful representations that enc...
No abstract
We detect ongoing innovation in empirical data about human technological innovations. Ongoing technological innovation is a form of open-ended evolution, but it occurs in a nonbiological, cultural population that consists of actual technological innovations that exist in the real world. The change over time of this population of innovations seems to be quite open-ended. We take patented inventions as a proxy for technological innovations and mine public patent records for evidence of the ongoing emergence of technological innovations, and we compare two ways to detect it. One way detects the first instances of predefined patent pigeonholes, specifically the technology classes listed in the United States Patent Classification (USPC). The second way embeds patents in a high-dimensional semantic space and detects the emergence of new patent clusters. After analyzing hundreds of years of patent records, both methods detect the emergence of new kinds of technologies, but clusters are much better at detecting innovations that are unanticipated and undetected by USPC pigeonholes. Our clustering methods generalize to detect unanticipated innovations in other evolving populations that generate ongoing streams of digital data.
Despite their lack of a rigid structure, intrinsically disordered regions in proteins play important roles in cellular functions, including mediating protein-protein interactions. Therefore, it is important to computationally annotate disordered regions of proteins with high accuracy. Most popular tools use evolutionary or biophysical features to make predictions of disordered regions. In this study, we present DR-BERT, a compact protein language model that is first pretrained on a large number of unannotated proteins before being trained to predict disordered regions. Although it does not use any evolutionary or biophysical information, DR-BERT shows a statistically significant improvement when compared to several existing methods on a gold standard dataset. We show that this performance is due to the information learned during pre-training and DR-BERT's ability to use contextual information. A web application for using DR-BERT is available at https://huggingface.co/spaces/nambiar4/DR-BERT and the code to run the model can be found at https://github.com/maslov-group/DR-BERT.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.