2021
DOI: 10.7717/peerj.11456
|View full text |Cite
|
Sign up to set email alerts
|

K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes

Abstract: Every day more plant genomes are available in public databases and additional massive sequencing projects (i.e., that aim to sequence thousands of individuals) are formulated and released. Nevertheless, there are not enough automatic tools to analyze this large amount of genomic information. LTR retrotransposons are the most frequent repetitive sequences in plant genomes; however, their detection and classification are commonly performed using semi-automatic and time-consuming programs. Despite the availabilit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
3

Relationship

1
8

Authors

Journals

citations
Cited by 17 publications
(6 citation statements)
references
References 66 publications
(95 reference statements)
0
4
0
Order By: Relevance
“…Basically, a bunch of LTR-RT taken from InpactorDB [ 15 ] was randomly placed inside an entire DNA sequence with a fixed length of 50, 000 bp. The nucleotides filling the space between one LTR-RT and another corresponded to sequences that are known to not contain LTR-RT (negative data set taken from [ 45 ] DOI: 10.5281/zenodo.4543904 , See Methodology section). After the synthetic creation of DNA sequences, they were transformed into a one-hot 2D representation and they were used as features for training the CNN.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Basically, a bunch of LTR-RT taken from InpactorDB [ 15 ] was randomly placed inside an entire DNA sequence with a fixed length of 50, 000 bp. The nucleotides filling the space between one LTR-RT and another corresponded to sequences that are known to not contain LTR-RT (negative data set taken from [ 45 ] DOI: 10.5281/zenodo.4543904 , See Methodology section). After the synthetic creation of DNA sequences, they were transformed into a one-hot 2D representation and they were used as features for training the CNN.…”
Section: Resultsmentioning
confidence: 99%
“…Create a synthetic DNA sequence of 50, 000 bp by concatenating sequences known to not include any LTR-RT (i.e coding sequences, different types of RNA like mRNA, tRNA, non-coding RNA, and other types of TEs such as TEs Class II) from [ 45 ] DOI: 10.5281/zenodo.4543904 . These sequences are called “negative background”.…”
Section: Methodsmentioning
confidence: 99%
“…Due to the categorical nature of genomic data, this activity is crucial to be able to use ML models [ 36 ]. K -mers frequencies were used as features using 1 ≤ k ≤ 6 due to this approach seems to be useful for machine learning algorithms [ 37 ]. To this converted data set, scaling and dimension reduction techniques were applied using principal component analysis (PCA) with an explained variance of 96% (reduction of the initial number of features from 5460 to 2254).…”
Section: Methodsmentioning
confidence: 99%
“…As a simple and effective feature extraction method, K-mer has been widely used in a variety of prediction models [43,44]. The feature extraction principle of K-mer is to count the number of occurrences of k consecutive nucleotides in the RNA sequence.…”
Section: K Monomeric Units (K-mer)mentioning
confidence: 99%