Determining intrinsically disordered regions of proteins is essential for elucidating protein biological functions and the mechanisms of their associated diseases. As the gap between the number of experimentally determined protein structures and the number of protein sequences continues to grow exponentially, there is a need for developing an accurate and computationally efficient disorder predictor. However, current single-sequence-based methods are of low accuracy, while evolutionary profile-based methods are computationally intensive. Here, we proposed a fast and accurate protein disorder predictor LMDisorder that employed embedding generated by unsupervised pretrained language models as features. We showed that LMDisorder performs best in all single-sequence-based methods and is comparable or better than another language-model-based technique in four independent test sets, respectively. Furthermore, LMDisorder showed equivalent or even better performance than the state-of-the-art profile-based technique SPOT-Disorder2. In addition, the high computation efficiency of LMDisorder enabled proteome-scale analysis of human, showing that proteins with high predicted disorder content were associated with specific biological functions. The datasets, the source codes, and the trained model are available at https://github.com/biomed-AI/LMDisorder.
The interactions between proteins and nucleic acids play crucial roles in various biological activities and the design of new drugs. How to identify the nucleic-acid-binding sites accurately remains a challenging task. Currently, the existing sequence-based methods have limited predictive performance due to only considering contextual features of the sequential neighbors, while structure-based methods are not suitable for proteins mostly without known tertiary structures. Though protein structures predicted by AlphaFold2 could be used, the extensive computing requirement of AlphaFold2 hinders its use for genome-wide applications. Based on the recent breakthrough of ESMFold for fast protein structure prediction, we propose GLMSite, an accurate predictor for identifying DNA and RNA-binding sites through geometric graph learning on ESMFold predicted structures. Here, the predicted protein structures are employed to construct protein structural graph with residues as nodes and spatially neighboring residue pairs for edges. The node representations are further enhanced through the pre-trained language model ProtTrans. The network was trained through the geometric vector perceptron, and the learned geometric embeddings are input into a common network to learn common binding characteristics, followed by two fully connected layers to learn specific binding patterns for DNA and RNA, respectively. Through comprehensive tests on DNA/RNA benchmark datasets, GLMSite was shown to surpass the latest sequence-based methods and be comparable with structure-based methods. Moreover, the prediction was shown useful for identifying nucleic-acid-binding proteins, demonstrating its potential for protein function discovery. The datasets, source codes, together with trained models are available at https://github.com/biomed-AI/nucleic-acid-binding.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.