2020
DOI: 10.1093/nargab/lqz024
|View full text |Cite
|
Sign up to set email alerts
|

RNAsamba: neural network-based assessment of the protein-coding potential of RNA sequences

Abstract: The advent of high-throughput sequencing technologies made it possible to obtain large volumes of genetic information, quickly and inexpensively. Thus, many efforts are devoted to unveiling the biological roles of genomic elements, being the distinction between protein-coding and long non-coding RNAs one of the most important tasks. We describe RNAsamba, a tool to predict the coding potential of RNA molecules from sequence information using a neural network-based that models both the whole sequence and the ORF… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
72
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 89 publications
(74 citation statements)
references
References 45 publications
2
72
0
Order By: Relevance
“…LncRNA is unique from other RNA based on size (>199 nucleotides) and limited evidence of protein-coding potential (36,37,(42)(43)(44). However, lncRNA's novel nature means there is no consensus on the best way to classify the protein-coding and non-coding potential of the lncRNA, so we use both logistic regression (45) and multilayered neural networks (46). The former implements human-designed features, such as open reading frame (ORF) length and integrity, GC content, and hexamer usage bias, whereas the latter identifies multilayered deep patterns solely on sequence information.…”
Section: Introductionmentioning
confidence: 99%
“…LncRNA is unique from other RNA based on size (>199 nucleotides) and limited evidence of protein-coding potential (36,37,(42)(43)(44). However, lncRNA's novel nature means there is no consensus on the best way to classify the protein-coding and non-coding potential of the lncRNA, so we use both logistic regression (45) and multilayered neural networks (46). The former implements human-designed features, such as open reading frame (ORF) length and integrity, GC content, and hexamer usage bias, whereas the latter identifies multilayered deep patterns solely on sequence information.…”
Section: Introductionmentioning
confidence: 99%
“…RNAsamba [ 31 ] is a novel neural network-based framework for predicting the coding potential of genomic sequences by assessing the ORF and other sequencing information. The performance of RNAsamba was evaluated from the on-transcripts from a diverse set of five organisms, including human, M. musculus , D. rerio , D. melanogaster and Saccharomyces cerevisiae .…”
Section: Resultsmentioning
confidence: 99%
“…In this study, we employed an integrative approach that combined both computational and data-driven modelling approaches, which was a novel framework for investigating noncoding genes. In particular, we applied six state-of-the-art bioinformatic suites: CPAT [ 27 ], CPC2 [ 28 , 29 ], LGC web server [ 30 ], CNIT, RNAsamba [ 31 ], and MiPepid [ 32 ] ( Table 1 ) to classify more than 21,000 lncRNAs collected from the ZFLNC [ 33 ], Ensembl [ 34 ], NONCODE [ 35 ], and zflncRNApedia [ 36 ] databases.…”
Section: Methodsmentioning
confidence: 99%
“…To our knowledge, this work also represents the first instance in which nucleotide sequence embeddings and transformer models are applied to the problem of building coding potential predictive models. Using nucleotide embeddings might be preferable to other representations like one-hot encoding [14] or integer encoding [6] used previously. This is because embeddings are learnt from the complete human genome and incorporate the context in which a given codon is found in the DNA.…”
Section: Discussionmentioning
confidence: 99%
“…Several classical machine learning [23,21,38,41] and deep learning [14,2,6] based models, which focus on longer length nucleotide sequences as input, have also been developed to predict the coding potential of a given RNA. Most of these methods demonstrate very high prediction performance.…”
Section: Introductionmentioning
confidence: 99%