Generative Capacity of Probabilistic Protein Sequence Models

McGee, Francisco; Novinger, Quentin; Levy, Ronald M.; Carnevale, Vincenzo; Haldane, Allan

doi:10.21203/rs.3.rs-145189/v1

Cited by 12 publications

(20 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We especially thank Francisco McGee and Vincenzo Carnevale for providing generated samples from DeepSequence as in ref. 34 . Our work was partially funded by the EU H2020 Research and Innovation Programme MSCA-RISE-2016 under Grant Agreement No.…”

Section: Acknowledgementsmentioning

confidence: 99%

“…It currently provides one of the best mutational-effect predictors, and we will show below that arDCA provides comparable quality of prediction for this specific task. The DeepSequence code has been modified in 34 to explore its capacities in generating artificial sequences being statistically indistinguishable from the natural MSA; it was shown that its performance was substantially less accurate than bmDCA. Another implementation of a VAE was reported in 35 ; also in this case the generative performances are inferior to bmDCA, but the organization of latent variables was shown to carry significant information on functionality.…”

mentioning

confidence: 99%

“…bmDCA was previously shown to be generative not only in a statistical sense, but also in a biological one: sequences generated by bmDCA were shown to be statistically indistinguishable from natural ones, and most importantly, functional in vivo for the case of chorismate mutase enzymes 20 . We also compare the generative property of arDCA with DeepSequence 33,34 as a prominent representative of deep generative models.…”

mentioning

confidence: 99%

“…Section "Predicting mutational effects via in-silico deep mutational scanning", we apply the modification of ref. 34 allowing for sequence sampling. We observe that for most families, the two-point and three-point correlations of the natural data are significantly less well reproduced by DeepSequence than by both DCA implementations, confirming the original findings of ref.…”

mentioning

confidence: 99%

“…We observe that for most families, the two-point and three-point correlations of the natural data are significantly less well reproduced by DeepSequence than by both DCA implementations, confirming the original findings of ref. 34 . Only in the largest family, PF00072 with more than 800,000 sequences, DeepSequence reaches comparable or, in the case of the threepoint correlations, even superior performance.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Efficient generative modeling of protein sequences using simple autoregressive models

et al. 2021

View full text Add to dashboard Cite

Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 102 and 103). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model’s entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 1068 possible sequences, which nevertheless constitute only the astronomically small fraction 10−80 of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.

show abstract

Section: Acknowledgementsmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations