2022
DOI: 10.1101/2022.12.15.519894
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Codon language embeddings provide strong signals for protein engineering

Abstract: Protein representations from deep language models have yielded state-of-the-art performance across many tasks in computational protein engineering. In recent years, progress has primarily focused on parameter count, with recent models' capacities surpassing the size of the very datasets they were trained on. Here, we propose an alternative direction. We show that large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

4
18
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 13 publications
(22 citation statements)
references
References 59 publications
4
18
0
Order By: Relevance
“…This deluge of data poses both an opportunity and a challenge; on the one hand, abundant genomic data able to uncover the intricate patterns of natural variability across species and populations is vastly available; on the other hand, powerful deep learning methods, that can operate at large scale, are required to perform accurate signal extraction from this large amount of unlabelled genomic data. Large foundation models trained on sequences of nucleotides appear to be the natural choice to seize the opportunity in genomics research, with attempts to explore different angles recently proposed [18, 19, 19, 20].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…This deluge of data poses both an opportunity and a challenge; on the one hand, abundant genomic data able to uncover the intricate patterns of natural variability across species and populations is vastly available; on the other hand, powerful deep learning methods, that can operate at large scale, are required to perform accurate signal extraction from this large amount of unlabelled genomic data. Large foundation models trained on sequences of nucleotides appear to be the natural choice to seize the opportunity in genomics research, with attempts to explore different angles recently proposed [18, 19, 19, 20].…”
Section: Introductionmentioning
confidence: 99%
“…On the one hand, intricate patterns of natural variability across species and populations are vastly available; on the other hand, powerful deep learning methods, that can operate at large scale, are required to perform accurate signal extraction from unlabelled data. Large foundation models trained on sequences of nucleotides appear to be a natural choice to tackle this problem [15, 16, 16, 17].…”
Section: Introductionmentioning
confidence: 99%
“…At the same time, new classes of ML models should be developed for protein fitness prediction to take advantage of uncertainty and introduce helpful inductive biases for the domain. , There exist methods that take advantage of inductive biases and prior information about proteins, such as the assumption that most mutation effects are additive or incorporation of biophysical knowledge into models as priors. Another method biases the search toward variants with fewer mutations, which are more likely to be stable and functional . Domain-specific self-supervision has been explored by training models on codons rather than amino acid sequences. ,, There are also efforts to utilize calibrated uncertainty about predicted fitnesses of proteins that lie out of the domain of previously screened proteins from the training set, but there is a need to expand and further test these methods in real settings. , It is still an open question whether supervised models can extrapolate beyond their training data to predict novel proteins. , More expressive deep learning methods, such as deep kernels, , could be explored as an alternative to Gaussian processes for uncertainty quantification in BO. Overall, there is significant potential to improve ML-based protein fitness prediction to help guide the search toward proteins with ideal fitness.…”
Section: Navigating Protein Fitness Landscapes Using Machine Learningmentioning
confidence: 99%
“…However, synonymous mutations that do not change the amino acid sequence can still significantly impact protein expression and function , or can even relate to particular structural features. , Predictors working on the level of nucleotides or codons may thus be better tools for protein design and modification, particularly in areas such as prediction of expression, solubility, and aggregation. The fact that the 64-letter codon alphabet serves to encode richer information than the 20-letter amino acid alphabet can be directly exploited by ML models for improving performance on a wide range of tasks that are now being tackled at the protein sequence level . While there have been several studies, e.g.…”
Section: Major Gaps In the State Of The Artmentioning
confidence: 99%