Motivation: PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads provides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space.Results: We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy.Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec.Contact: lordec@lirmm.fr.Supplementary information: Supplementary data are available at Bioinformatics online.
In metazoans, thousands of DNA replication origins (Oris) are activated at each cell cycle. Their genomic organization and their genetic nature remain elusive. Here, we characterized Oris by nascent strand (NS) purification and a genome-wide analysis in Drosophila and mouse cells. We show that in both species most CpG islands (CGI) contain Oris, although methylation is nearly absent in Drosophila, indicating that this epigenetic mark is not crucial for defining the activated origin. Initiation of DNA synthesis starts at the borders of CGI, resulting in a striking bimodal distribution of NS, suggestive of a dual initiation event. Oris contain a unique nucleotide skew around NS peaks, characterized by G/T and C/A overrepresentation at the 59 and 39 of Ori sites, respectively. Repeated GC-rich elements were detected, which are good predictors of Oris, suggesting that common sequence features are part of metazoan Oris. In the heterochromatic chromosome 4 of Drosophila, Oris correlated with HP1 binding sites. At the chromosome level, regions rich in Oris are early replicating, whereas Ori-poor regions are late replicating. The genome-wide analysis was coupled with a DNA combing analysis to unravel the organization of Oris. The results indicate that Oris are in a large excess, but their activation does not occur at random. They are organized in groups of site-specific but flexible origins that define replicons, where a single origin is activated in each replicon. This organization provides both site specificity and Ori firing flexibility in each replicon, allowing possible adaptation to environmental cues and cell fates.
A viral quasispecies, the ensemble of viral strains populating an infected person, can be highly diverse. For optimal assessment of virulence, pathogenesis, and therapy selection, determining the haplotypes of the individual strains can play a key role. As many viruses are subject to high mutation and recombination rates, high-quality reference genomes are often not available at the time of a new disease outbreak. We present SAVAGE, a computational tool for reconstructing individual haplotypes of intra-host virus strains without the need for a high-quality reference genome. SAVAGE makes use of either FM-index-based data structures or ad hoc consensus reference sequence for constructing overlap graphs from patient sample data. In this overlap graph, nodes represent reads and/or contigs, while edges reflect that two reads/contigs, based on sound statistical considerations, represent identical haplotypic sequence. Following an iterative scheme, a new overlap assembly algorithm that is based on the enumeration of statistically well-calibrated groups of reads/contigs then efficiently reconstructs the individual haplotypes from this overlap graph. In benchmark experiments on simulated and on real deep-coverage data, SAVAGE drastically outperforms generic de novo assemblers as well as the only specialized de novo viral quasispecies assembler available so far. When run on ad hoc consensus reference sequence, SAVAGE performs very favorably in comparison with state-of-the-art reference genome-guided tools. We also apply SAVAGE on two deep-coverage samples of patients infected by the Zika and the hepatitis C virus, respectively, which sheds light on the genetic structures of the respective viral quasispecies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.