2019
DOI: 10.1016/j.dib.2019.104209
|View full text |Cite
|
Sign up to set email alerts
|

Genome-wide hairpins datasets of animals and plants for novel miRNA prediction

Abstract: This article makes available several genome-wide datasets, which can be used for training microRNA (miRNA) classifiers. The hairpin sequences available are from the genomes of: Homo sapiens, Arabidopsis thaliana, Anopheles gambiae, Caenorhabditis elegans and Drosophila melanogaster . Each dataset provides the genome data divided into sequences and a set of computed features for predictions. Each sequence has one label: i) “positive”: meaning that it is a well-known… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
1

Relationship

2
4

Authors

Journals

citations
Cited by 6 publications
(5 citation statements)
references
References 22 publications
0
5
0
Order By: Relevance
“…If a too-long window length is used, many hairpins can be captured inside the same sequence, thus structural features become more complex and much more difficult to recognize by the classifier. These issues were discussed properly in previous works ( Bugnon et al, 2019 , Yones et al, 2015 ). Thus, to prevent these adverse influences and to ensure that no important sequences are lost nor inappropriately trimmed, the genome is cut into overlapped segments longer than the mean length of the pre-miRNAs of interest for the species under processing (in this case, viruses).…”
Section: Identifying Novel Pre-mirnas In Sars-cov-2mentioning
confidence: 79%
See 2 more Smart Citations
“…If a too-long window length is used, many hairpins can be captured inside the same sequence, thus structural features become more complex and much more difficult to recognize by the classifier. These issues were discussed properly in previous works ( Bugnon et al, 2019 , Yones et al, 2015 ). Thus, to prevent these adverse influences and to ensure that no important sequences are lost nor inappropriately trimmed, the genome is cut into overlapped segments longer than the mean length of the pre-miRNAs of interest for the species under processing (in this case, viruses).…”
Section: Identifying Novel Pre-mirnas In Sars-cov-2mentioning
confidence: 79%
“…The positive labeled samples (known pre-miRNAs) were downloaded from miRBase v22, retrieving 569 pre-miRNAs of viruses. A total of 73 structural features from the folded sequences of the virus genome and the well-known viruses pre-miRNAs were extracted with miRNAfe ( Yones et al, 2015 ) as in Bugnon et al (2019) , and normalized with z-score. The included features are: length of the sequence, MFE, cumulative size of internal loops found in the secondary structure, number of loops, absolute and relative GC content, among many others (detailed information in Supplementary Material).…”
Section: Data Preparation and Performance Measuresmentioning
confidence: 99%
See 1 more Smart Citation
“…The major issue with using machine learning to detect pre-miRNAs is that the number of well-known pre-miRNAs is typically few in comparison to the hundreds of thousands of candidate sequences in a genome, making this a high-class imbalanced classification challenge [ 17 ]. H. sapiens genome is an example that has 1710 well-known pre-miRNAs but over 400 million hairpin-like sequences resulting in a 1:28128 imbalance [ 18 ]. ML algorithms are generally representing with balanced data sets but in a supervised classifier, imbalanced data tend to produce a model biased towards the majority class, with low performance in the minority one yielding false positives [ 19 ].…”
Section: Introductionmentioning
confidence: 99%
“…It is actually being used in most recent prediction models (Yones et al, 2017;Acar et al, 2018). Furthermore, it has been already used successfully for building 6 public genome-wide datasets Bugnon et al (2019). HextractoR helps to standardize and simplify the stem-loop extraction stage, making future prediction methods easier to use and their experiments fully reproducible.…”
Section: Introductionmentioning
confidence: 99%