TIR-Learner, a New Ensemble Method for TIR Transposable Element Annotation, Provides Evidence for Abundant New Transposable Elements in the Maize Genome

Su, Weijia; Gu, Xun; Peterson, Thomas

doi:10.1016/j.molp.2019.02.008

Cited by 121 publications

(100 citation statements)

References 76 publications

Supporting

Mentioning

100

Contrasting

Order By: Relevance

“…These elements are highly abundant in eukaryotic genomes, and, as such, there are a large number of annotation programs designed to identify them. We tested P-MITE [31], a specialized database of curated plant MITEs, and IRF [50], TIR-Learner [17], and GRF ( grf-main -c 0 ) (https://github.com/bioinfolabmu/GenericRepeatFinder), which structurally identify TIR elements, and finally MITE-Hunter [51], detectMITE [52], MUSTv2 [53], miteFinderII [54], MITE-Tracker [55], and GRF ( grf-mite ), which structurally identify MITEs specifically.…”

Section: Resultsmentioning

confidence: 99%

“…We found less than half of the novel TIR elements with novel TIRs had more than three copies in the rice genome (Figure 5D). This is because TIR candidates were not filtered based on copy number in TIR-Learner [17], given that some TEs may share similar TIRs but different internal regions (Figure 5D). Still, some of these could be contaminants such as LTR sequences.…”

Section: Resultsmentioning

confidence: 99%

“…For example, hAT elements typically have an 8-bp TSD, 12-28 bp terminal inverted repeat sequence (TIRs), and contain 5’-C/TA…TA/G-3’ terminal sequences. Each Class II superfamily has different structural features that need to be considered when TE annotation programs are being developed and deployed [16, 17]. Helitrons are a unique subclass of Class II elements that replicate through a rolling-circle mechanism and, as such, do not generate a TSD sequence and do not have TIRs, but do have a signature 5’-TC…CTRR-3’ terminal repeat sequence and frequently a short GC-rich stem-loop structure near the 3’ end of the element [16,18,19].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline

Liao

et al. 2019

Preprint

Self Cite

173

239

View full text Add to dashboard Cite

20Sequencing technology and assembly algorithms have matured to the point that high-21 quality de novo assembly is possible for large, repetitive genomes. Current assemblies traverse 22 transposable elements (TEs) and allow for annotation of TEs. There are numerous methods for 23 each class of elements with unknown relative performance metrics. We benchmarked existing 24 programs based on a curated library of rice TEs. Using the most robust programs, we created a 25 comprehensive pipeline called Extensive de-novo TE Annotator (EDTA) that produces a 26 condensed TE library for annotations of structurally intact and fragmented elements. EDTA is 27 open-source and freely available: https://github.com/oushujun/EDTA. 28 Keywords 29 Transposable element; Annotation; Genome; Benchmarking; Pipeline 30 31Long-read sequencing (e.g., PacBio and Oxford Nanopore) and assembly scaffolding 50 (e.g., Hi-C and BioNano) techniques have progressed rapidly within the last few years. These 51 innovations have been critical for high-quality assembly of the repetitive fraction of genomes. In 52 fact, Ou et al. [8] demonstrated that the assembly contiguity of repetitive sequences in recent 53 long-read assemblies is even better than traditional BAC-based reference genomes. With these 54 developments, inexpensive and high-quality assembly of an entire genome is now possible. 55Knowing where features (i.e., genes, TEs, etc.) exist in a genome assembly is important 56 4 information for using these assemblies for biological findings. However, unlike the relatively 57 straightforward and comprehensive pipelines established for gene annotation [9][10][11], current 58 methods for TE annotation can be piecemeal, inaccurate, and are highly specific to classes of 59 transposable elements. 60Transposable elements fall into two major classes. Class I elements, also known as 61 retrotransposons, use an RNA intermediate in their "copy and paste" mechanism of 62 transposition [12]. Class I elements can be further divided into long terminal repeat (LTR) 63 retrotransposons, as well as those that lack LTRs (non-LTRs), which include long interspersed 64 nuclear elements (LINEs), and short interspersed nuclear elements (SINEs). Structural features 65 of these elements can facilitate automated de novo annotation in a genome assembly. For 66 example, LTR elements have a 5-bp target site duplication (TSD), while non-LTRs have either 67 variable length TSDs or lack TSDs entirely, being instead associated with deletion of flanking 68 sequences upon insertion [13]. There are also standard terminal sequences associated with 69 LTR elements (i.e., 5'-TG…C/G/TA-3' for LTR-Copia and 5'-TG…CA-3' for LTR-Gypsy 70 elements), and non-LTRs often have a terminal poly-A tail at the 3' end of the element (see [14] 71 for a complete description of structural features of each superfamily). 72The second major class of TEs, Class II elements, also known as DNA transposons, use 73 a DNA intermediate in their "cut and paste" mechanism of transposition [15]. As with Class I 74...

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline

Liao

et al. 2019

Preprint

Self Cite

173

239

View full text Add to dashboard Cite

show abstract

“…However, this approach can be difficult to implement for TEs and other repetitive sequences. Prior studies of several loci had suggested high levels of variation in TE content among maize haplotypes (Fu and Dooner, ; Yao et al ., ; Brunner et al ., ), and genomic level comparisons using whole‐genome assemblies have been limited to assessing annotated copy number per family without resolution at the level of individual TEs (Springer et al ., ; Su et al ., ). In this study we used an approach to assess the shared and non‐shared nature of individual TEs within collinear homologous blocks of four assembled maize genomes.…”

Section: Discussionmentioning

confidence: 97%

Transposable elements contribute to dynamic genome content in maize

et al. 2019

View full text Add to dashboard Cite

Transposable elements (TEs) are ubiquitous components of eukaryotic genomes and can create variation in genome organization and content. Most maize genomes are composed of TEs. We developed an approach to define shared and variable TE insertions across genome assemblies and applied this method to four maize genomes (B73, W22, Mo17 and PH207) with uniform structural annotations of TEs. Among these genomes we identified approximately 400 000 TEs that are polymorphic, encompassing 1.6 Gb of variable TE sequence. These polymorphic TEs include a combination of recent transposition events as well as deletions of older TEs. There are examples of polymorphic TEs within each of the superfamilies of TEs and they are found distributed across the genome, including in regions of recent shared ancestry among individuals. There are many examples of polymorphic TEs within or near maize genes. In addition, there are 2380 gene annotations in the B73 genome that are located within variable TEs, providing evidence for the role of TEs in contributing to the substantial differences in annotated gene content among these genotypes. TEs are highly variable in our survey of four temperate maize genomes, highlighting the major contribution of TEs in driving variation in genome organization and gene content.

show abstract

“…There is much literature about applications of machine learning in bioinformatics (for example, reviewed in (Larrañaga et al, 2006)), showing improvements in many aspects such as genome annotation (Arango-López et al, 2017). In recent years, much bioinformatics software has been developed to detect TEs (Girgis, 2015) and, although they follow different strategies (such as homology-based, structure-based, de novo, and using comparative genomics), these lack sensitivity and specificity due to the polymorphic structures of TEs (Su, Gu & Peterson, 2019). Loureiro et al (Loureiro et al, 2013a) proved that ML could be used to improve the accuracy of TEs detection by combining results obtained by several conventional software and training a classifier using these results (Schietgat et al, 2018), (Loureiro et al, 2013b).…”

Section: Benefits Of ML Over Bioinformatics (Q1)mentioning

confidence: 99%

Peer Review #1 of "A systematic review of the application of machine learning in the detection and classification of transposable elements (v0.1)"

2019

View full text Add to dashboard Cite

Background. Transposable elements (TEs) constitute the most common repeated sequences in eukaryotic genomes. Recent studies demonstrated their deep impact on species diversity, adaptation to the environment, and diseases. Although there are many conventional bioinformatics algorithms for detecting and classifying TEs, none have achieved reliable results on different types of TEs. Machine learning (ML) techniques can automatically extract hidden patterns and novel information from labeled or non-labeled data and have been applied to solving several scientific problems. Methodology. We followed the Systematic Literature Review process, applying the six stages of the review protocol from it, but added a previous stage, which aims to detect the need for a review. Then search equations were formulated and executed in several literature databases. Relevant publications were scanned and used to extract evidence to answer research questions. Results. Several ML approaches have already been tested on other bioinformatics problems with promising results, yet there are few algorithms and architectures available in literature focused specifically on TEs, despite representing the majority of the nuclear DNA of many organisms. Only 35 papers were found and categorized as relevant in TE or related fields. Conclusions. ML is a powerful tool that can be used to address many problems. Although ML techniques have been used widely in other biological tasks, their utilization in TE analyses is still limited. Following the Systematic Literature Review process, it was possible to notice that the use of ML for TE analyses (detection and classification) is an open problem, and this new field of research is growing in interest.

show abstract

TIR-Learner, a New Ensemble Method for TIR Transposable Element Annotation, Provides Evidence for Abundant New Transposable Elements in the Maize Genome

Cited by 121 publications

References 76 publications

Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline

Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline

Transposable elements contribute to dynamic genome content in maize

Peer Review #1 of "A systematic review of the application of machine learning in the detection and classification of transposable elements (v0.1)"

Contact Info

Product

Resources

About