StarFunc: fusing template-based and deep learning approaches for accurate protein function prediction

Zhang, Chengxin; Liu, Quancheng; Freddolino, Lydia

doi:10.1101/2024.05.15.594113

2024

DOI: 10.1101/2024.05.15.594113

|View full text |Cite

Preprint

StarFunc: fusing template-based and deep learning approaches for accurate protein function prediction

Chengxin Zhang,

Quancheng Liu,

Lydia Freddolino

Abstract: Deep learning has significantly advanced the development of high-performance methods for protein function prediction. Nonetheless, even for state-of-the-art deep learning approaches, template information remains an indispensable component in most cases. While many function prediction methods use templates identified through sequence homology or protein-protein interactions, very few methods detect templates through structural similarity, even though protein structures are the basis of their functions. Here, we… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Preprint1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Protein2Text: Providing Rich Descriptions for Protein Sequences

Dotan,

Lyubman,

Bacharach

et al. 2024

Preprint

View full text Add to dashboard Cite

Understanding the functionality of proteins has been a focal point of biological research due to their critical roles in various biological processes. Unraveling protein functions is essential for advancements in medicine, agriculture, and biotechnology, enabling the development of targeted therapies, engineered crops, and novel biomaterials. However, this endeavor is challenging due to the complex nature of proteins, requiring sophisticated experimental designs and extended timelines to uncover their specific functions. Public large language models (LLMs), though proficient in natural language processing, struggle with biological sequences due to the unique and intricate nature of biochemical data. These models often fail to accurately interpret and predict the functional and structural properties of proteins, limiting their utility in bioinformatics. To address this gap, we introduce BetaDescribe, a collection of models designed to generate detailed and rich textual descriptions of proteins, encompassing properties such as function, catalytic activity, involvement in specific metabolic pathways, subcellular localizations, and the presence of particular domains. The trained BetaDescribe model receives protein sequences as input and outputs a textual description of these properties. BetaDescribe's starting point was the LLAMA2 model, which was trained on trillions of tokens. Next, we trained our model on datasets containing both biological and English text, allowing biological knowledge to be incorporated. We demonstrate the utility of BetaDescribe by providing descriptions for proteins that share little to no sequence similarity to proteins with functional descriptions in public datasets. We also show that BetaDescribe can be harnessed to conduct in-silico mutagenesis procedures to identify regions important for protein functionality without needing homologous sequences for the inference. Altogether, BetaDescribe offers a powerful tool to explore protein functionality, augmenting existing approaches such as annotation transfer based on sequence or structure similarity.

show abstract

Protein2Text: Providing Rich Descriptions for Protein Sequences

Dotan,

Lyubman,

Bacharach

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

StarFunc: fusing template-based and deep learning approaches for accurate protein function prediction

Cited by 1 publication

References 43 publications

Protein2Text: Providing Rich Descriptions for Protein Sequences

Protein2Text: Providing Rich Descriptions for Protein Sequences

Contact Info

Product

Resources

About