Convolutions are competitive with transformers for protein sequence pretraining

Yang, Kevin K.; Fusi, Nicolo; Lu, Alex X.

doi:10.1016/j.cels.2024.01.008

Cited by 17 publications

(2 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Motivated by decades of research into biophysics, molecular dynamics, and protein simulation [10, 23, 24, 27, 35], we present METL, which leverages synthetic data from molecular simulations to pretrain biophysics-aware PLMs. These biophysical pretraining signals are in contrast to existing PLMs or multiple sequence alignment-based methods that train on natural sequences and capture signals related to evolutionary selective pressures [2, 7, 8, 14, 36, 37]. By pretraining on large-scale molecular simulations, METL builds a comprehensive map of protein biophysical space.…”

Section: Discussionmentioning

confidence: 99%

Biophysics-based protein language models for protein engineering

Gelman,

Johnson,

Freschlin

et al. 2024

Preprint

View full text Add to dashboard Cite

Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure, and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose Mutational Effect Transfer Learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure, and energetics. We finetune METL on experimental sequence-function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity, and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL's ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.

show abstract

Section: Discussionmentioning

confidence: 99%

Biophysics-based protein language models for protein engineering

Gelman,

Johnson,

Freschlin

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Some work has added graph neural network components to sequence models for downstream tasks, though these do not strictly qualify as fine-tuning methods. For example, ProtSSN 35 initializes EGNN 36 with sequence models to enhance variant prediction capabilities, MIF-ST 37 uses CARP 38 language model to boost the inverse folding task capability of graph neural networks, and ESM-GearNet 39 enhances downstream task capabilities by combining with ESM2 and GearNet.…”

Section: ■ Introductionmentioning

confidence: 99%

Simple, Efficient, and Scalable Structure-Aware Adapter Boosts Protein Language Models

Tan,

Li,

Zhou

et al. 2024

J. Chem. Inf. Model.

View full text Add to dashboard Cite

Fine-tuning pretrained protein language models (PLMs) has emerged as a prominent strategy for enhancing downstream prediction tasks, often outperforming traditional supervised learning approaches. As a widely applied powerful technique in natural language processing, employing parameter-efficient fine-tuning techniques could potentially enhance the performance of PLMs. However, the direct transfer to life science tasks is nontrivial due to the different training strategies and data forms. To address this gap, we introduce SES-Adapter, a simple, efficient, and scalable adapter method for enhancing the representation learning of PLMs. SES-Adapter incorporates PLM embeddings with structural sequence embeddings to create structure-aware representations. We show that the proposed method is compatible with different PLM architectures and across diverse tasks. Extensive evaluations are conducted on 2 types of folding structures with notable quality differences, 9 state-of-the-art baselines, and 9 benchmark data sets across distinct downstream tasks. Results show that compared to vanilla PLMs, SES-Adapter improves downstream task performance by a maximum of 11% and an average of 3%, with significantly accelerated convergence speed by a maximum of 1034% and an average of 362%, the training efficiency is also improved by approximately 2 times. Moreover, positive optimization is observed even with low-quality predicted structures. The source code for SES-Adapter is available at https://github.com/tyang816/SES-Adapter.

show abstract

Protein engineering in the deep learning era

Zhou,

Tan,

et al. 2024

mLife

View full text Add to dashboard Cite

Advances in deep learning have significantly aided protein engineering in addressing challenges in industrial production, healthcare, and environmental sustainability. This review frames frequently researched problems in protein understanding and engineering from the perspective of deep learning. It provides a thorough discussion of representation methods for protein sequences and structures, along with general encoding pipelines that support both pre‐training and supervised learning tasks. We summarize state‐of‐the‐art protein language models, geometric deep learning techniques, and the combination of distinct approaches to learning from multi‐modal biological data. Additionally, we outline common downstream tasks and relevant benchmark datasets for training and evaluating deep learning models, focusing on satisfying the particular needs of protein engineering applications, such as identifying mutation sites and predicting properties for candidates' virtual screening. This review offers biologists the latest tools for assisting their engineering projects while providing a clear and comprehensive guide for computer scientists to develop more powerful solutions by standardizing problem formulation and consolidating data resources. Future research can foresee a deeper integration of the communities of biology and computer science, unleashing the full potential of deep learning in protein engineering and driving new scientific breakthroughs.

show abstract

Convolutions are competitive with transformers for protein sequence pretraining

Cited by 17 publications

References 48 publications

Biophysics-based protein language models for protein engineering

Biophysics-based protein language models for protein engineering

Simple, Efficient, and Scalable Structure-Aware Adapter Boosts Protein Language Models

Protein engineering in the deep learning era

Contact Info

Product

Resources

About