2017
DOI: 10.48550/arxiv.1712.03346
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Variational auto-encoding of protein sequences

Abstract: Proteins are responsible for the most diverse set of functions in biology. The ability to extract information from protein sequences and to predict the effects of mutations is extremely valuable in many domains of biology and medicine. However the mapping between protein sequence and function is complex and poorly understood. Here we present an embedding of natural protein sequences using a Variational Auto-Encoder and use it to predict how mutations affect protein function. We use this unsupervised approach t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
22
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
5
3
1

Relationship

2
7

Authors

Journals

citations
Cited by 19 publications
(22 citation statements)
references
References 14 publications
0
22
0
Order By: Relevance
“…While predictive accuracy could be improved by more computationally-expensive simulations or by collecting more data for machine learning, improved variants can already be identified by sampling from a space predicted to be dense in higher fitness variants. Nevertheless, full datasets collected with higher throughput methods such as deep mutational scanning (36) serve as valuable test beds for validating the latest machinelearning algorithms for both regression (37,38) and design (39) that require more data.…”
Section: Discussionmentioning
confidence: 99%
“…While predictive accuracy could be improved by more computationally-expensive simulations or by collecting more data for machine learning, improved variants can already be identified by sampling from a space predicted to be dense in higher fitness variants. Nevertheless, full datasets collected with higher throughput methods such as deep mutational scanning (36) serve as valuable test beds for validating the latest machinelearning algorithms for both regression (37,38) and design (39) that require more data.…”
Section: Discussionmentioning
confidence: 99%
“…Schemes to embed biological sequences fall into several general categories, notably VAEs [75,76,47,77,78] and invertible generative models (e.g. Flows or real-NVPs) [79], as well as a plethora of increasingly promising models that have been adapted from natural language processing [80,8,48,81,82].…”
Section: Exploration On Compressed Representations Of Sequencesmentioning
confidence: 99%
“…We thank Chris Sander, Frank Poelwijk, David Duvenaud, Sam Sinai, Eric Kelsic and members of the Marks lab for helpful comments and discussions. While in progress Sinai et al also reported on use of variational autoencoders for protein sequences [84]. A.J.R.…”
Section: Acknowledgementsmentioning
confidence: 99%