The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all of the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a pipeline that greatly facilitates this process. This program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete long terminal repeat (LTR) retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately 3 times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, http://www.repeatmasker.org/RepeatModeler/).
Metabarcoding has the potential to become a rapid, sensitive, and effective approach for identifying species in complex environmental samples. Accurate molecular identification of species depends on the ability to generate operational taxonomic units (OTUs) that correspond to biological species. Due to the sometimes enormous estimates of biodiversity using this method, there is a great need to test the efficacy of data analysis methods used to derive OTUs. Here, we evaluate the performance of various methods for clustering length variable 18S amplicons from complex samples into OTUs using a mock community and a natural community of zooplankton species. We compare analytic procedures consisting of a combination of (1) stringent and relaxed data filtering, (2) singleton sequences included and removed, (3) three commonly used clustering algorithms (mothur, UCLUST, and UPARSE), and (4) three methods of treating alignment gaps when calculating sequence divergence. Depending on the combination of methods used, the number of OTUs varied by nearly two orders of magnitude for the mock community (60–5068 OTUs) and three orders of magnitude for the natural community (22–22191 OTUs). The use of relaxed filtering and the inclusion of singletons greatly inflated OTU numbers without increasing the ability to recover species. Our results also suggest that the method used to treat gaps when calculating sequence divergence can have a great impact on the number of OTUs. Our findings are particularly relevant to studies that cover taxonomically diverse species and employ markers such as rRNA genes in which length variation is extensive.
Understanding the rates, spectra, and fitness effects of spontaneous mutations is fundamental to answering key questions in evolution, molecular biology, disease genetics, and conservation biology. To estimate mutation rates and evaluate the effect of selection on new mutations, we propagated mutation accumulation (MA) lines of Daphnia pulex for more than 82 generations and maintained a non-MA population under conditions where selection could act. Both experiments were started with the same obligate asexual progenitor clone. By sequencing 30 genomes and implementing a series of validation steps that informed the bioinformatic analyses, we identified a total of 477 single nucleotide mutations (SNMs) in the MA lines, corresponding to a mutation rate of 2.30 × 10 (95% CI 1.90-2.70 × 10) per nucleotide per generation. The high overall rate of loss of heterozygosity (LOH) of 4.82 × 10 per site per generation was due to a large ameiotic recombination event spanning an entire arm of a chromosome (∼6 Mb) and several hemizygous deletion events spanning ∼2 kb each. In the non-MA population, we found significantly fewer mutations than expected based on the rate derived from the MA experiment, indicating purifying selection was likely acting to remove new deleterious mutations. We observed a surprisingly high level of genetic variability in the non-MA population, which we propose to be driven by balancing selection. Our findings suggest that both positive and negative selection on new mutations is powerful and effective in a strictly clonal population.
A long-standing evolutionary puzzle is that all eukaryotic genomes contain large amounts of tandemly-repeated DNA whose sequence motifs and abundance vary greatly among even closely related species. To elucidate the evolutionary forces governing tandem repeat dynamics, quantification of the rates and patterns of mutations in repeat copy number and tests of its selective neutrality are necessary. Here, we used whole-genome sequences of 28 mutation accumulation (MA) lines of , in addition to six isolates from a non-MA population originating from the same progenitor, to both estimate mutation rates of abundances of repeat sequences and evaluate the selective regime acting upon them. We found that mutation rates of individual repeats were both high and highly variable, ranging from additions/deletions of 0.29-105 copies per generation (reflecting changes of 0.12-0.80% per generation). Our results also provide evidence that new repeat sequences are often formed from existing ones. The non-MA population isolates showed a signal of either purifying or stabilizing selection, with 33% lower variation in repeat copy number on average than the MA lines, although the level of selective constraint was not evenly distributed across all repeats. The changes between many pairs of repeats were correlated, and the pattern of correlations was significantly different between the MA lines and the non-MA population. Our study demonstrates that tandem repeats can experience extremely rapid evolution in copy number, which can lead to high levels of divergence in genome-wide repeat composition between closely related species.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.