Motivation: Inferring properties of biological sequences--such as determining the species-of-origin of a DNA sequence or the function of an amino-acid sequence--is a core task in many bioinformatics applications. These tasks are often solved using string-matching to map query sequences to labeled database sequences or via Hidden Markov Model-like pattern matching. In the current work we describe and assess an deep learning approach which trains a deep neural network (DNN) to predict database-derived labels directly from query sequences. Results: We demonstrate this DNN performs at state-of-the-art or above levels on a difficult, practically important problem: predicting species-of-origin from short reads of 16S ribosomal DNA. When trained on 16S sequences of over 13,000 distinct species, our DNN achieves read-level species classification accuracy within 2.0% of perfect memorization of training data, and produces more accurate genus-level assignments for reads from held-out species than k -mer, alignment, and taxonomic binning baselines. Moreover, our models exhibit greater robustness than these existing approaches to increasing noise in the query sequences. Finally, we show that these DNNs perform well on experimental 16S mock community dataset. Overall, our results constitute a first step towards our long-term goal of developing a general-purpose deep learning approach to predicting meaningful labels from short biological sequences. Availability: TensorFlow training code is available through GitHub ( https://github.com/tensorflow/models/tree/master/research ). Data in TensorFlow TFRecord format is available on Google Cloud Storage (gs://brain-genomics-public/research/seq2species/).
AAVs hold tremendous promise as delivery vectors for clinical gene therapy. Yet the ability to design libraries comprising novel and diverse AAV capsids, while retaining the ability of the library to package DNA payloads, has remained challenging. Deep sequencing technologies allow millions of sequences to be assayed in parallel, enabling large-scale probing of fitness landscapes. Such data can be used to train supervised machine learning (ML) models that predict viral properties from sequence, without mechanistic knowledge. Herein, we leverage such models to rationally trade-off library diversity with packaging capability. In particular, we show a proof-of-principle application of a general approach for ML-guided library design that allows the experimenter to rationally navigate the trade-off between sequence diversity and fitness of the library. Consequently, this approach, instantiated with an AAV capsid library designed for packaging, enables the selection of starting libraries that are more likely to yield success in downstream selections for therapeutics and beyond. We demonstrated this increased success by showing that the designed libraries are able to more easily infect primary human brain tissue. We expect that such ML-guided design of AAV libraries will have broad utility for the development of novel variants for therapeutic applications in the near future.One Sentence SummaryComputational, data-driven re-design of a state-of-the-art therapeutically relevant AAV initial library improves downstream selection for therapeutic uses.
Work completed as a member of the Google Brain Residency program (g.co/brainresidency) Motivation: Recently developed deep learning techniques have significantly improved the accuracy of various speech and image recognition systems. In this paper we show how to adapt some of these techniques to create a novel chained convolutional architecture with next-step conditioning for improving performance on protein sequence prediction problems. We explore its value by demonstrating its ability to improve performance on eight-class secondary structure prediction. Results: We first establish a state-of-the-art baseline by adapting recent advances in convolutional neural networks which were developed for vision tasks. This model achieves 70.0% per amino acid accuracy on the CB513 benchmark dataset without use of standard performance-boosting techniques such as ensembling or multitask learning. We then improve upon this state-of-the-art result using a novel chained prediction approach which frames the secondary structure prediction as a next-step prediction problem. This sequential model achieves 70.3% Q8 accuracy on CB513 with a single model; an ensemble of these models produces 71.4% Q8 accuracy on the same test set, improving upon the previous overall state of the art for the eight-class secondary structure problem. Availability: Our models are implemented using TensorFlow, an open-source machine learning software library available at TensorFlow.org. We aim to release the code for these experiments as part of the TensorFlow repository. An early version of this work is available on arXiv.org
Characterizing differences in biological sequences between two conditions using high-throughput sequencing data is a prevalent problem wherein we seek to (i) quantify differences in sequence abundances between conditions, and (ii) build predictive models to estimate such differences for unobserved sequences. A key shortcoming of current approaches is their extremely limited ability to share information across related but non-identical reads. Consequently, they cannot make effective use of sequencing data, nor can they be directly applied in many settings of interest. We introduce model-based enrichment (MBE) to overcome this shortcoming. MBE is based on sound theoretical principles, is easy to implement, and can trivially make use of advances in modern-day machine learning classification architectures or related innovations. We extensively evaluate MBE empirically, both in simulation and on real data. Overall, we find that our new approach improves accuracy compared to current ways of performing such differential analyses.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.