Expanding functional protein sequence spaces using generative adversarial networks

Repecka, Donatas; Jauniškis, Vykintas; Karpus, Laurynas; Rembeza, Elzbieta; Rokaitis, Irmantas; Zrimec, Jan; Povilonienė, Simona; Laurynėnas, Audrius; Viknander, Sandra; Abuajwa, Wissam; Savolainen, Otto; Meškys, Rolandas; Engqvist, Martin K. M.; Zelezniak, Aleksej

doi:10.1038/s42256-021-00310-5

Cited by 247 publications

(271 citation statements)

References 67 publications

Supporting

Mentioning

269

Contrasting

Order By: Relevance

“…Another potential trend is building DNNs using biophysical ( Tareen and Kinney, 2019 ) or physicochemical properties ( Yang et al, 2017 ; Liu et al, 2020 ), as deep models trained on these features might uncover novel patterns in data and lead to improved understanding of the physicochemical principles of protein-nucleic acid regulatory interactions, as well as aid model interpretability. Other novel approaches include: 1) modifying DNN properties to improve recovery of biologically meaningful motif representations ( Koo and Ploenzke, 2021 ), 2) transformer networks ( Devlin et al, 2018 ) and attention mechanisms ( Vaswani et al, 2017 ), widely used in protein sequence modeling ( Jurtz et al, 2017 ; Rao et al, 2019 ; Vig et al, 2020 ; Repecka et al, 2021 ), 3) graph convolutional neural networks, a class of DNNs that can work directly on graphs and take advantage of their structural information, with the potential to give us great insights if we can reframe genomics problems as graphs ( Cranmer et al, 2020 ; Strokach et al, 2020 ), and 4) generative modeling ( Foster, 2019 ), which may help exploit current knowledge in designing synthetic sequences with desired properties ( Killoran et al, 2017 ; Wang Y. et al, 2020 ). With the latter, unsupervized training is used with approaches including: 1) autoencoders, which learn efficient representations of the training data, typically for dimensionality reduction ( Way and Greene, 2018 ) or feature selection ( Xie et al, 2017 ), 2) generative adversarial networks, which learn to generate new data with the same statistics as the training set ( Wang Y. et al, 2020 ; Repecka et al, 2021 ), and 3) deep belief networks, which learn to probabilistically reconstruct their inputs, acting as feature detectors, and can be further trained with supervision to build efficient classifiers ( Bu et al, 2017 ).…”

Section: Discussionmentioning

confidence: 99%

“…Despite the capability of RNNs to learn sequential information (e.g. multiplicity, relative order), they are computationally expensive to train and certain improvements to CNNs, such as dilation ( Yu and Koltun, 2015 ) and self-attention ( Wang et al, 2017 ; Bello et al, 2019 ; Repecka et al, 2021 ), enable them to outperform RNNs ( Gupta and Rush, 2017 ; Strubell et al, 2017 ; Trabelsi et al, 2019 ). Dilated convolution uses kernels with gaps to allow each kernel to capture information across a larger stretch of the input sequence, without incurring the increased cost of using RNNs ( Gupta and Rush, 2017 ; Strubell et al, 2017 ).…”

Section: Learning the Protein-dna Interactions Initiating Gene Expressionmentioning

confidence: 99%

See 1 more Smart Citation

Learning the Regulatory Code of Gene Expression

Zrimec

Buric

Kokina

et al. 2021

Front. Mol. Biosci.

Self Cite

View full text Add to dashboard Cite

Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Learning the Protein-dna Interactions Initiating Gene Expressionmentioning

confidence: 99%

Learning the Regulatory Code of Gene Expression

Zrimec

Buric

Kokina

et al. 2021

Front. Mol. Biosci.

Self Cite

View full text Add to dashboard Cite

show abstract

“…After aligning the generations to a sequence with known structure in Fig 4be, we observed that the conserved positions in generated sequences correlate with ligand-binding and buried residues. Using previously published sequences and their experimentally-measured assay data for CM 6 and MDH 52 proteins, we also evaluated the concordance of ProGen's model likelihood for these sequences to their relative activity and compared it with the generative methods utilized in the original studies-bmDCA 6 and proteinGAN 52 . Specifically, we measured per-token log-likelihoods for artificial sequences using ProGen (see Methods) and used them to predict if artificial sequences should function, which showed an area under the curve (AUC) of 0.85, significantly better (p<0.0001, two-tailed test, n = 1617) than bmDCA, which had an AUC of 0.78 (Fig.…”

Section: Letmentioning

confidence: 99%

“…After fine-tuning, we generated a set of 64k sequences using top-p sampling ( = 0.75) from the CM and MDH fine-tuned models respectively. We measured concordance of our model's log-likelihoods with protein function data on CM and MDH sequences, and compared with bmDCA 6 and ProteinGAN 52 baselines respectively. We computed the area under the curve (AUC) in receiver operating characteristic (ROC) curves for predicting binary function labels from model scores.…”

Section: Evaluating Progen On Other Protein Systemsmentioning

confidence: 99%

Deep neural language modeling enables functional protein generation across families

Madani

Krause

Greene

et al. 2021

Preprint

View full text Add to dashboard Cite

Bypassing nature's evolutionary trajectory, de novo protein generation - defined as creating artificial protein sequences from scratch - could enable breakthrough solutions for biomedical and environmental challenges. Viewing amino acid sequences as a language, we demonstrate that a deep learning-based language model can generate functional artificial protein sequences across families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. Our protein language model is trained by simply learning to predict the next amino acid for over 280 million protein sequences from thousands of protein families, without biophysical or coevolutionary modeling. We experimentally evaluate model-generated artificial proteins on five distinct antibacterial lysozyme families. Artificial proteins show similar activities and catalytic efficiencies as representative natural lysozymes, including hen egg white lysozyme, while reaching as low as 44% identity to any known naturally-evolved protein. The X-ray crystal structure of an enzymatically active artificial protein recapitulates the conserved fold and positioning of active site residues found in natural proteins. We demonstrate our language model's ability to be adapted to different protein families by accurately predicting the functionality of artificial chorismate mutase and malate dehydrogenase proteins. These results indicate that neural language models successfully perform de novo protein generation across protein families and may prove to be a tool to shortcut evolution.

show abstract

“…This probabilistic model can be sampled and the resulting sequences have been experimentally shown to often produce functional proteins, including enzymes [7,52]. Deep Learning (DL) models with various architectures have also been trained on sets of homologous sequences and used to generate active peptides [22], or enzymes [51] (with success rates similar to those of DCA [52]). The main limitation of these approaches, beyond the need for training data and risks of over-fitting, is that they reproduce existing functions/folds, which is rarely the sole aim of protein design.…”

Section: Discussionmentioning

confidence: 99%

Molecular flexibility in computational protein design: an algorithmic perspective

Bouchiba

Cortés

Schiex

et al. 2021

Protein Engineering, Design and Selection

View full text Add to dashboard Cite

Computational protein design (CPD) is a powerful technique for engineering new proteins, with both great fundamental implications and diverse practical interests. However, the approximations usually made for computational efficiency, using a single fixed backbone and a discrete set of side chain rotamers, tend to produce rigid and hyper-stable folds that may lack functionality. These approximations contrast with the demonstrated importance of molecular flexibility and motions in a wide range of protein functions. The integration of backbone flexibility and multiple conformational states in CPD, in order to relieve the inaccuracies resulting from these simplifications and to improve design reliability, are attracting increased attention. However, the greatly increased search space that needs to be explored in these extensions defines extremely challenging computational problems. In this review, we outline the principles of CPD and discuss recent effort in algorithmic developments for incorporating molecular flexibility in the design process.

show abstract

Expanding functional protein sequence spaces using generative adversarial networks

Cited by 247 publications

References 67 publications

Learning the Regulatory Code of Gene Expression

Learning the Regulatory Code of Gene Expression

Deep neural language modeling enables functional protein generation across families

Molecular flexibility in computational protein design: an algorithmic perspective

Contact Info

Product

Resources

About