Fine-tuning protein language models boosts predictions across diverse tasks

Schmirler, Robert; Heinzinger, Michael; Rost, Burkhard

doi:10.1038/s41467-024-51844-2

Cited by 7 publications

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

HaloClass: Salt-Tolerant Protein Classification with Protein Language Models

Narang,

Nath,

Hemstrom

et al. 2024

Protein J

View full text Add to dashboard Cite

Salt-tolerant proteins, also known as halophilic proteins, have unique adaptations to function in high-salinity environments. These proteins have naturally evolved in extremophilic organisms, and more recently, are being increasingly applied as enzymes in industrial processes. Due to an abundance of salt-tolerant sequences and a simultaneous lack of experimental structures, most computational methods to predict stability are sequence-based only. These approaches, however, are hindered by a lack of structural understanding of these proteins. Here, we present HaloClass, an SVM classifier that leverages ESM-2 protein language model embeddings to accurately identify salt-tolerant proteins. On a newer and larger test dataset, HaloClass outperforms existing approaches when predicting the stability of never-before-seen proteins that are distal to its training set. Finally, on a mutation study that evaluated changes in salt tolerance based on single- and multiple-point mutants, HaloClass outperforms existing approaches, suggesting applications in the guided design of salt-tolerant enzymes.

show abstract

HaloClass: Salt-Tolerant Protein Classification with Protein Language Models

Narang,

Nath,

Hemstrom

et al. 2024

Protein J

View full text Add to dashboard Cite

show abstract

Aggregating Residue-Level Protein Language Model Embeddings with Optimal Transport

NaderiAlizadeh,

Singh

2024

Preprint

View full text Add to dashboard Cite

MotivationProtein language models (PLMs) have emerged as powerful approaches for mapping protein sequences into informative embeddings suitable for a range of applications. PLMs, as well as many other protein representation schemes, generate per-token (i.e., per-residue) representations, leading to variable-sized outputs based on protein length. This variability presents a challenge for protein-level prediction tasks, which require uniform-sized embeddings for consistent analysis across different proteins. Prior work has typically resorted to average pooling to summarize token-level PLM outputs. It is, however, unclear if such an aggregation operation effectively prioritizes the relevant information across token-level representations.ResultsAddressing this, we introduce a novel method utilizing sliced-Wasserstein embeddings to convert variable-length PLM outputs into fixed-length protein-level representations. Inspired by the success of optimal transport techniques in representation learning, we first conceptualize per-token PLM outputs as samples from a probabilistic distribution. We then employ sliced-Wasserstein distances to map these samples against a learnable reference set, creating a Euclidean embedding in the output space. The resulting embedding is agnostic to the length of the input and represents the entire protein. Across a range of state-of-the-art pre-trained ESM-2 PLMs, with varying model sizes, we show the superiority of our method over average pooling for protein-drug and protein-protein interaction. Our aggregation scheme is especially effective when model size is constrained, enabling smaller-scale PLMs to match or exceed the performance of average-pooled larger-scale PLMs. Since using smaller models reduces computational resource requirements, our approach not only promises more accurate inference but can also help democratize access to foundation models.Availability and implementationThe implementation code can be found athttps://github.com/navid-naderi/PLM_SWE.

show abstract

ProtNote: a multimodal method for protein-function annotation

Char,

Corley,

Alamdari

et al. 2024

Preprint

View full text Add to dashboard Cite

Understanding the protein sequence-function relationship is essential for advancing protein biology and engineering. However, fewer than 1% of known protein sequences have human-verified functions. While deep learning methods have demonstrated promise for protein function prediction, current models are limited to predicting only those functions on which they were trained. Here, we introduce ProtNote, a multimodal deep learning model that leverages free-form text to enable both supervised and zero-shot protein function prediction. ProtNote not only maintains near state-of-the-art performance for annotations in its train set, but also generalizes to unseen and novel functions in zero-shot test settings. We envision that ProtNote will enhance protein function discovery by enabling scientists to use free text inputs, without restriction to predefined labels – a necessary capability for navigating the dynamic landscape of protein biology.

show abstract

Fine-tuning protein language models boosts predictions across diverse tasks

Cited by 7 publications

References 54 publications

HaloClass: Salt-Tolerant Protein Classification with Protein Language Models

HaloClass: Salt-Tolerant Protein Classification with Protein Language Models

Aggregating Residue-Level Protein Language Model Embeddings with Optimal Transport

ProtNote: a multimodal method for protein-function annotation

Contact Info

Product

Resources

About