ICOR: Improving codon optimization with recurrent neural networks

Jain, Rishab; Jain, Aditya; Mauro, Elizabeth; LeShane, Kevin; Densmore, Douglas

doi:10.1101/2021.11.08.467706

Cited by 6 publications

(7 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One common approach for codon optimization is a text translation task, where a sentence written in the language of protein amino acids is translated to DNA nucleotides, conditioned on the identity of the host organism to maximize the natural profile of the CDS [39][40][41]. To this end, we repurposed the T5 language model architecture [43] by training on a dataset, named Protein2DNA, consisting of all protein coding and protein-CDS pairs from high-quality genomes of the Enterobacterales order (taxonomic identifier 91347) (Materials and methods).…”

Section: Resultsmentioning

confidence: 99%

“…Natural language models based on Deep Learning (DL) have emerged as powerful tools for interrogating complex context-dependent patterns in biological sequences across application domains [35][36][37][38]. Although there are recent examples of DL-enabled codon optimization [39][40][41], these studies do not incorporate expression level information. Here we show that language models are able to learn long-range patterns of codon usage and generate sequences mimicking natural codon usage profiles when trained on genome-scale CDS data.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Deep learning-based codon optimization with large-scale synonymous variant datasets enables generalized tunable protein expression

Constant

Gutierrez

Sastry

et al. 2023

Preprint

View full text Add to dashboard Cite

Increasing recombinant protein expression is of broad interest in industrial biotechnology, synthetic biology, and basic research. Codon optimization is an important step in heterologous gene expression that can have dramatic effects on protein expression level. Several codon optimization strategies have been developed to enhance expression, but these are largely based on bulk usage of highly frequent codons in the host genome, and can produce unreliable results. Here, we develop deep contextual language models that learn the codon usage rules from natural protein coding sequences across members of theEnterobacteralesorder. We then fine-tune these models with over 150,000 functional expression measurements of synonymous coding sequences from three proteins to predict expression inE. coli. We find that our models recapitulate natural context- specific patterns of codon usage and can accurately predict expression levels across synonymous sequences. Finally, we show that expression predictions can generalize across proteins unseen during training, allowing forin silicodesign of gene sequences for optimal expression. Our approach provides a novel and reliable method for tuning gene expression with many potential applications in biotechnology and biomanufacturing.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Deep learning-based codon optimization with large-scale synonymous variant datasets enables generalized tunable protein expression

Constant

Gutierrez

Sastry

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…As was noted hereinbefore, the RNN is widely recognized for its effectiveness in processing sequential data like text, time series, and speech. In [12], the use of LSTM-RNN architecture was explored for optimizing codon usage in protein sequences. The novel tool was developed to learn codon usage bias from a genomic dataset comprising over 7,000 Escherichia coli genes.…”

Section: B Recurrent Neural Networkmentioning

confidence: 99%

Applying the Deep Learning Techniques to Solve Classification Tasks Using Gene Expression Data

Babichev,

Liakh,

Kalinina

2024

IEEE Access

View full text Add to dashboard Cite

This manuscript explores the application of deep learning (DL) techniques for classifying gene expression data. A key aspect of our research is the comparative analysis of various DL neural network architectures, including Convolution Neural Networks (CNN), Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) Recurrent Neural Networks (RNN), as well as hybrid models that combine these networks. We applied the Bayesian optimization algorithm using 5-fold cross-validation for optimal hyperparameter tuning, which is crucial for DL algorithm performance. Significantly, we have advanced the methods for applying RNNs in processing gene expression data, particularly focusing on LSTM and GRU types. Our study introduces also a novel hybrid quality criterion for data classification, calculated as a weighted sum of partial quality criteria, incorporating an integrated F1-score derived through the Harrington desirability method. Furthermore, we investigate hybrid models that leverage various DL methods, enhancing decision-making objectivity in sample identification. This model uses a step-by-step information processing procedure, initially applying different DL models to gene expression data and subsequently processing these through a CART-based classifier for final decision-making. Our experiments, performed on gene expression data from patients with eight cancer types and one subset with normal samples (without cancer), demonstrated that GRU-RNN-based models, particularly a two-layer GRU-RNN, achieved the highest classification efficacy, with an accuracy of 97.8% on the test dataset. The performance of this model exceeded that of other models, whose accuracy varied between 96.6% and 97.3%. Comparative analysis with other studies in this field suggests that the proposed techniques demonstrate higher efficacy compared to similar research regarding the application of DL models for cancer-type diagnosis.

show abstract

“…Using recurrent neural networks can drastically reduce the simulation time of numerical solutions. RNNs are ideal for learning sequential data, due to the feedback loops, or connections within their layers [21,26], that provide an internal memory of past observations. In particular, RNNs with long short-term memory (LSTM) [27] layers can deal with longer sequences, where the sequential dependence of the data spans over a large number of observations, since these types of network avoid vanishing gradients problems [28].…”

Section: Rnns With Lstm Layersmentioning

confidence: 99%

“…In this work, we develop a fully data-driven approach using conditional recurrent neural networks (RNN) [21] to predict the outcomes of nonlinear pulse propagation for a range of input-pulse parameters in different waveguide structures that simultaneously exhibit secondand third-order nonlinearities. The developed approach also allows for the training of a unified network to compute both the spectral and temporal evolution of a pulse via assigning the real and imaginary parts of the pulse complex-envelope as the network condition.…”

Section: Introduction -Machine Learningmentioning

confidence: 99%

Conditional Recurrent Neural Networks for broad applications in nonlinear optics

Lauria,

Saleh

2023

Preprint

View full text Add to dashboard Cite

We present a novel implementation of conditional Long Short-Term Memory Recurrent Neural Networks that successfully predict the spectral evolution of a pulse in nonlinear periodically-poled waveguides. The developed networks offer large flexibility by allowing the propagation of optical pulses with range of energies and temporal widths in waveguides with different poling periods. The results show very high agreement with the traditional numerical models. Moreover, we are able to use a single network to calculate both the real and imaginary parts of the pulse complex envelope, allowing for successfully retrieving the pulse temporal and spectral evolution using the same network.

show abstract

ICOR: Improving codon optimization with recurrent neural networks

Cited by 6 publications

References 34 publications

Deep learning-based codon optimization with large-scale synonymous variant datasets enables generalized tunable protein expression

Deep learning-based codon optimization with large-scale synonymous variant datasets enables generalized tunable protein expression

Applying the Deep Learning Techniques to Solve Classification Tasks Using Gene Expression Data

Conditional Recurrent Neural Networks for broad applications in nonlinear optics

Contact Info

Product

Resources

About