2021
DOI: 10.1101/2021.11.08.467706
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ICOR: Improving codon optimization with recurrent neural networks

Abstract: In protein sequences—as there are 61 sense codons but only 20 standard amino acids—most amino acids are encoded by more than one codon. Although such synonymous codons do not alter the encoded amino acid sequence, their selection can dramatically affect the expression of the resulting protein. Codon optimization of synthetic DNA sequences is important for heterologous expression. However, existing solutions are primarily based on choosing high-frequency codons only, neglecting the important effects of rare cod… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
3

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 34 publications
0
7
0
Order By: Relevance
“…One common approach for codon optimization is a text translation task, where a sentence written in the language of protein amino acids is translated to DNA nucleotides, conditioned on the identity of the host organism to maximize the natural profile of the CDS [39][40][41]. To this end, we repurposed the T5 language model architecture [43] by training on a dataset, named Protein2DNA, consisting of all protein coding and protein-CDS pairs from high-quality genomes of the Enterobacterales order (taxonomic identifier 91347) (Materials and methods).…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…One common approach for codon optimization is a text translation task, where a sentence written in the language of protein amino acids is translated to DNA nucleotides, conditioned on the identity of the host organism to maximize the natural profile of the CDS [39][40][41]. To this end, we repurposed the T5 language model architecture [43] by training on a dataset, named Protein2DNA, consisting of all protein coding and protein-CDS pairs from high-quality genomes of the Enterobacterales order (taxonomic identifier 91347) (Materials and methods).…”
Section: Resultsmentioning
confidence: 99%
“…Natural language models based on Deep Learning (DL) have emerged as powerful tools for interrogating complex context-dependent patterns in biological sequences across application domains [35][36][37][38]. Although there are recent examples of DL-enabled codon optimization [39][40][41], these studies do not incorporate expression level information. Here we show that language models are able to learn long-range patterns of codon usage and generate sequences mimicking natural codon usage profiles when trained on genome-scale CDS data.…”
Section: Introductionmentioning
confidence: 99%
“…As was noted hereinbefore, the RNN is widely recognized for its effectiveness in processing sequential data like text, time series, and speech. In [12], the use of LSTM-RNN architecture was explored for optimizing codon usage in protein sequences. The novel tool was developed to learn codon usage bias from a genomic dataset comprising over 7,000 Escherichia coli genes.…”
Section: B Recurrent Neural Networkmentioning
confidence: 99%
“…Using recurrent neural networks can drastically reduce the simulation time of numerical solutions. RNNs are ideal for learning sequential data, due to the feedback loops, or connections within their layers [21,26], that provide an internal memory of past observations. In particular, RNNs with long short-term memory (LSTM) [27] layers can deal with longer sequences, where the sequential dependence of the data spans over a large number of observations, since these types of network avoid vanishing gradients problems [28].…”
Section: Rnns With Lstm Layersmentioning
confidence: 99%
“…In this work, we develop a fully data-driven approach using conditional recurrent neural networks (RNN) [21] to predict the outcomes of nonlinear pulse propagation for a range of input-pulse parameters in different waveguide structures that simultaneously exhibit secondand third-order nonlinearities. The developed approach also allows for the training of a unified network to compute both the spectral and temporal evolution of a pulse via assigning the real and imaginary parts of the pulse complex-envelope as the network condition.…”
Section: Introduction -Machine Learningmentioning
confidence: 99%