2021
DOI: 10.1038/s42256-021-00310-5
|View full text |Cite
|
Sign up to set email alerts
|

Expanding functional protein sequence spaces using generative adversarial networks

Abstract: De novo protein design for catalysis of any desired chemical reaction is a long standing goal in protein engineering, due to the broad spectrum of technological, scientific and medical applications. Currently, mapping protein sequence to protein function is, however, neither computationionally nor experimentally tangible 1,2 . Here we developed ProteinGAN, a specialised variant of the generative adversarial network 3 that is able to 'learn' natural protein sequence diversity and enables the generation of funct… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
269
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 247 publications
(271 citation statements)
references
References 67 publications
2
269
0
Order By: Relevance
“…Another potential trend is building DNNs using biophysical ( Tareen and Kinney, 2019 ) or physicochemical properties ( Yang et al, 2017 ; Liu et al, 2020 ), as deep models trained on these features might uncover novel patterns in data and lead to improved understanding of the physicochemical principles of protein-nucleic acid regulatory interactions, as well as aid model interpretability. Other novel approaches include: 1) modifying DNN properties to improve recovery of biologically meaningful motif representations ( Koo and Ploenzke, 2021 ), 2) transformer networks ( Devlin et al, 2018 ) and attention mechanisms ( Vaswani et al, 2017 ), widely used in protein sequence modeling ( Jurtz et al, 2017 ; Rao et al, 2019 ; Vig et al, 2020 ; Repecka et al, 2021 ), 3) graph convolutional neural networks, a class of DNNs that can work directly on graphs and take advantage of their structural information, with the potential to give us great insights if we can reframe genomics problems as graphs ( Cranmer et al, 2020 ; Strokach et al, 2020 ), and 4) generative modeling ( Foster, 2019 ), which may help exploit current knowledge in designing synthetic sequences with desired properties ( Killoran et al, 2017 ; Wang Y. et al, 2020 ). With the latter, unsupervized training is used with approaches including: 1) autoencoders, which learn efficient representations of the training data, typically for dimensionality reduction ( Way and Greene, 2018 ) or feature selection ( Xie et al, 2017 ), 2) generative adversarial networks, which learn to generate new data with the same statistics as the training set ( Wang Y. et al, 2020 ; Repecka et al, 2021 ), and 3) deep belief networks, which learn to probabilistically reconstruct their inputs, acting as feature detectors, and can be further trained with supervision to build efficient classifiers ( Bu et al, 2017 ).…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Another potential trend is building DNNs using biophysical ( Tareen and Kinney, 2019 ) or physicochemical properties ( Yang et al, 2017 ; Liu et al, 2020 ), as deep models trained on these features might uncover novel patterns in data and lead to improved understanding of the physicochemical principles of protein-nucleic acid regulatory interactions, as well as aid model interpretability. Other novel approaches include: 1) modifying DNN properties to improve recovery of biologically meaningful motif representations ( Koo and Ploenzke, 2021 ), 2) transformer networks ( Devlin et al, 2018 ) and attention mechanisms ( Vaswani et al, 2017 ), widely used in protein sequence modeling ( Jurtz et al, 2017 ; Rao et al, 2019 ; Vig et al, 2020 ; Repecka et al, 2021 ), 3) graph convolutional neural networks, a class of DNNs that can work directly on graphs and take advantage of their structural information, with the potential to give us great insights if we can reframe genomics problems as graphs ( Cranmer et al, 2020 ; Strokach et al, 2020 ), and 4) generative modeling ( Foster, 2019 ), which may help exploit current knowledge in designing synthetic sequences with desired properties ( Killoran et al, 2017 ; Wang Y. et al, 2020 ). With the latter, unsupervized training is used with approaches including: 1) autoencoders, which learn efficient representations of the training data, typically for dimensionality reduction ( Way and Greene, 2018 ) or feature selection ( Xie et al, 2017 ), 2) generative adversarial networks, which learn to generate new data with the same statistics as the training set ( Wang Y. et al, 2020 ; Repecka et al, 2021 ), and 3) deep belief networks, which learn to probabilistically reconstruct their inputs, acting as feature detectors, and can be further trained with supervision to build efficient classifiers ( Bu et al, 2017 ).…”
Section: Discussionmentioning
confidence: 99%
“…Despite the capability of RNNs to learn sequential information (e.g. multiplicity, relative order), they are computationally expensive to train and certain improvements to CNNs, such as dilation ( Yu and Koltun, 2015 ) and self-attention ( Wang et al, 2017 ; Bello et al, 2019 ; Repecka et al, 2021 ), enable them to outperform RNNs ( Gupta and Rush, 2017 ; Strubell et al, 2017 ; Trabelsi et al, 2019 ). Dilated convolution uses kernels with gaps to allow each kernel to capture information across a larger stretch of the input sequence, without incurring the increased cost of using RNNs ( Gupta and Rush, 2017 ; Strubell et al, 2017 ).…”
Section: Learning the Protein-dna Interactions Initiating Gene Expressionmentioning
confidence: 99%
“…After aligning the generations to a sequence with known structure in Fig 4be, we observed that the conserved positions in generated sequences correlate with ligand-binding and buried residues. Using previously published sequences and their experimentally-measured assay data for CM 6 and MDH 52 proteins, we also evaluated the concordance of ProGen's model likelihood for these sequences to their relative activity and compared it with the generative methods utilized in the original studies-bmDCA 6 and proteinGAN 52 . Specifically, we measured per-token log-likelihoods for artificial sequences using ProGen (see Methods) and used them to predict if artificial sequences should function, which showed an area under the curve (AUC) of 0.85, significantly better (p<0.0001, two-tailed test, n = 1617) than bmDCA, which had an AUC of 0.78 (Fig.…”
Section: Letmentioning
confidence: 99%
“…After fine-tuning, we generated a set of 64k sequences using top-p sampling ( = 0.75) from the CM and MDH fine-tuned models respectively. We measured concordance of our model's log-likelihoods with protein function data on CM and MDH sequences, and compared with bmDCA 6 and ProteinGAN 52 baselines respectively. We computed the area under the curve (AUC) in receiver operating characteristic (ROC) curves for predicting binary function labels from model scores.…”
Section: Evaluating Progen On Other Protein Systemsmentioning
confidence: 99%
“…This probabilistic model can be sampled and the resulting sequences have been experimentally shown to often produce functional proteins, including enzymes [7,52]. Deep Learning (DL) models with various architectures have also been trained on sets of homologous sequences and used to generate active peptides [22], or enzymes [51] (with success rates similar to those of DCA [52]). The main limitation of these approaches, beyond the need for training data and risks of over-fitting, is that they reproduce existing functions/folds, which is rarely the sole aim of protein design.…”
Section: Discussionmentioning
confidence: 99%