The development of powerful natural language models has improved the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution and next-generation sequencing have allowed for the accumulation of large amounts of labelled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder, which features a highly structured latent space that is trained to jointly generate sequences as well as predict fitness. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and a novel approach for efficient fitness landscape traversal. Using ReLSO, we explicitly model the sequence-function landscape of large labelled datasets and generate new molecules by optimizing within the latent space using gradient-based methods. We evaluate this approach on several publicly available protein datasets, including variant sets of anti-ranibizumab and green fluorescent protein. We observe a greater sequence optimization efficiency (increase in fitness per optimization step) using ReLSO compared with other approaches, where ReLSO more robustly generates high-fitness sequences. Furthermore, the attention-based relationships learned by the jointly trained ReLSO models provide a potential avenue towards sequence-level fitness attribution information. Articles NaTURe MachINe INTeLLIgeNceAn alternative to working in the sequence space is to learn a low-dimensional, semantically rich representation of peptides and proteins. These latent representations collectively form the latent space, which is easier to navigate. With this approach, a therapeutic candidate can be optimized using its latent representation, in a procedure called latent space optimization.Here we propose ReLSO, a deep transformer-based approach to protein design, which combines the powerful encoding ability of a transformer model with a bottleneck that produces information-rich, low-dimensional latent representations. The latent space in ReLSO, besides being low dimensional, is regularized to be (1) smooth with respect to structure and fitness by way of fitness prediction from the latent space, (2) continuous and interpolatable between training data points and (3) pseudoconvex on the basis of negative sampling outside the data. This highly designed latent space enables optimization directly in latent space using gradient ascent on the fitness and converges to an optimum that can then be decoded back into the sequence space.Key contributions of ReLSO include the following.
Human spaceflight endeavors present an opportunity to expand our presence beyond Earth. To this end, it is crucial to understand and diagnose effects of long‐term space travel on the human body. Developing tools for targeted, on‐site detection of specific DNA sequences will allow us to establish research and diagnostics platforms that will benefit space programs. We describe a simple DNA diagnostic method that utilizes colorimetric loop‐mediated isothermal amplification (LAMP) to enable detection of a repetitive telomeric DNA sequence in as little as 30 minutes. A proof of concept assay for this method was carried out using existing hardware on the International Space Station and the results were read instantly by an astronaut through a simple color change of the reaction mixture. LAMP offers a novel platform for on‐orbit DNA‐based diagnostics that can be deployed on the International Space Station and to the broader benefit of space programs.
The development of powerful natural language models have increased the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labeled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which is trained to jointly generate sequences as well as predict fitness. Using ReLSO, we explicitly model the underlying sequence-function landscape of large labeled datasets and optimize within latent space using gradient-based methods. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and novel approach for efficient fitness landscape traversal.Preprint. Under review.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.