EDAR: An Efficient Error Detection and Removal Algorithm for Next Generation Sequencing Data

Zhao, Xiaohong; Palmer, Lance E.; Bolanos, Randall; Mircean, Cristian; Fasulo, Daniel; Wittenberg, Gayle M.

doi:10.1089/cmb.2010.0127

Cited by 34 publications

(36 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In stage 1, the set of k-mers (substring of fixed length k) of reads from the processed data set is calculated and the distribution of frequencies of k-mers is analyzed (31). It was previously observed that the frequencies of erroneous and correct k-mers follow different distributions (32)(33)(34). Based on this fact, the error threshold is calculated as the minimal frequency of k-mers separating two different distributions.…”

Section: Methodsmentioning

confidence: 99%

Analysis of the Evolution and Structure of a Complex Intrahost Viral Population in Chronic Hepatitis C Virus Mapped by Ultradeep Pyrosequencing

Palmer

Dimitrova

Skums

et al. 2014

J Virol

View full text Add to dashboard Cite

Hepatitis C virus (HCV) causes chronic infection in up to 50% to 80% of infected individuals. Hypervariable region 1 (HVR1) variability is frequently studied to gain an insight into the mechanisms of HCV adaptation during chronic infection, but the changes to and persistence of HCV subpopulations during intrahost evolution are poorly understood. In this study, we used ultradeep pyrosequencing (UDPS) to map the viral heterogeneity of a single patient over 9.6 years of chronic HCV genotype 4a infection. Informed error correction of the raw UDPS data was performed using a temporally matched clonal data set. The resultant data set reported the detection of low-frequency recombinants throughout the study period, implying that recombination is an active mechanism through which HCV can explore novel sequence space. The data indicate that polyvirus infection of hepatocytes has occurred but that the fitness quotients of recombinant daughter virions are too low for the daughter virions to compete against the parental genomes. The subpopulations of parental genomes contributing to the recombination events highlighted a dynamic virome where subpopulations of variants are in competition. In addition, we provide direct evidence that demonstrates the growth of subdominant populations to dominance in the absence of a detectable humoral response. IMPORTANCEAnalysis of ultradeep pyrosequencing data sets derived from virus amplicons frequently relies on software tools that are not optimized for amplicon analysis, assume random incorporation of sequencing errors, and are focused on achieving higher specificity at the expense of sensitivity. Such analysis is further complicated by the presence of hypervariable regions. In this study, we made use of a temporally matched reference sequence data set to inform error correction algorithms. Using this methodology, we were able to (i) detect multiple instances of hepatitis C virus intrasubtype recombination at the E1/E2 junction (a phenomenon rarely reported in the literature) and (ii) interrogate the longitudinal quasispecies complexity of the virome. Parallel to the UDPS, isolation of IgG-bound virions was found to coincide with the collapse of specific viral subpopulations.

show abstract

Section: Methodsmentioning

confidence: 99%

Analysis of the Evolution and Structure of a Complex Intrahost Viral Population in Chronic Hepatitis C Virus Mapped by Ultradeep Pyrosequencing

Palmer

Dimitrova

Skums

et al. 2014

J Virol

View full text Add to dashboard Cite

show abstract

“…Available quality control software allow the user to completely remove these duplicates (FASTX -toolkit; or mark them for downstream analysis consideration (PICARD). Recently various algorithms utilizing suffix tree data structures were developed for sequencing error correction (Kelley et al, 2010;Zhao et al, 2010). A common procedure in the pre-analysis process, following initial quality control, and prior to sequence duplication removal, is the compulsory tag / adapter removal (Lassmann et al, 2009;Schmieder et al, 2010) and optional quality trimming.…”

Section: Pre-analysis Processingmentioning

confidence: 99%

Deep Sequencing Data Analysis: Challenges and Solutions

Isakov¹,

Shomron²

2011

Bioinformatics - Trends and Methodologies

View full text Add to dashboard Cite

“…[99,11] On the other hand, the exact error rate for real data can only be estimated. [53] Gain/Specificity/Sensitivity…”

Section: Methodsmentioning

confidence: 99%

“…EDAR [99] removes low quality reads and, from the remaining data, calculates the coverage for all possible k-mers. Using the variable bandwidth mean-shift method [100] for each read, EDAR clusters the k-mers and set each cluster as erroneous or correct using a threshold derived from the normalized distribution of the coverage.…”

Section: K-spectrum Based (Ksb)mentioning

confidence: 99%

“…When dealing with very long reads like Pacific Biosciences with a high percentage of errors [71], along with considering trimming their ends, an additional approach is to split the long reads into smaller, high quality segments. The authors of [99] do not correct the spurious bases, opting instead to split the read at the locus of the error and to remove the faulty base. They argue that for a further assembly step based on k-mers such as Velvet [196], their method should not pose any problems.…”

Section: Read Trimming and Splittingmentioning

confidence: 99%

See 1 more Smart Citation

Improved Error Correction of NGS Data

Alic¹

View full text Add to dashboard Cite

SummaryThe work done for this doctorate thesis focuses on error correction of Next Generation Sequencing (NGS) data in the context of High Performance Computing (HPC).Due to the reduction in sequencing cost, the increasing output of the sequencers and the advancements in the biological and medical sciences, the amount of NGS data has increased tremendously. Humans alone are not able to keep pace with this explosion of information, therefore computers must assist them to ease the handle of the deluge of information generated by the sequencing machines. Since NGS is no longer just a research topic (used in clinical routine to detect cancer mutations, for instance), requirements in performance and accuracy are more stringent. For sequencing to be useful outside research, the analysis software must work accurately and fast. This is where HPC comes into play. NGS processing tools should leverage the full potential of multi-core and even distributed computing, as those platforms are extensively available. Moreover, as the performance of the individual core has hit a barrier, current computing tendencies focus on adding more cores and explicitly split the computation to take advantage of them.This thesis starts with a deep analysis of all these problems in a general and comprehensive way (to reach out to a very wide audience), in the form of an exhaustive and objective review of the NGS error correction field. We dedicate a chapter to this topic to introduce the reader gradually and gently into the world of sequencing. It presents real problems and applications of NGS that demonstrate the impact this technology has on science. The review results in the following conclusions: the need of understanding of the specificities of NGS data samples (given the high variety of technologies and features) and the need of flexible, efficient and accurate tools for error correction as a preliminary step of any NGS postprocessing.As a result of the explosion of NGS data, we introduce MuffinInfo. It is a piece of software capable of extracting information from the raw data produced by the sequencer to help the user understand the data. MuffinInfo uses HTML5, therefore it runs in almost any software and hardware environment. It supports custom statistics to mould itself to specific requirements. MuffinInfo can reload the results of a run which are stored in JSON format for easier integration with third party applications. Finally, our application uses threads to perform the calculations, to load the data from the disk and to handle the UI.In continuation to our research and as a result of the single core performance limitation, we leverage the power of multi-core computers to develop a new error correction tool. The error correction of the NGS data is normally the first step of any analysis targeting NGS. As we conclude from the review performed within the frame of this thesis, many projects in different real-life applications have opted for this step before further analysis. In this sense, we propose MuffinEC, a multi-technology (Illu...

show abstract

EDAR: An Efficient Error Detection and Removal Algorithm for Next Generation Sequencing Data

Cited by 34 publications

References 22 publications

Analysis of the Evolution and Structure of a Complex Intrahost Viral Population in Chronic Hepatitis C Virus Mapped by Ultradeep Pyrosequencing

Analysis of the Evolution and Structure of a Complex Intrahost Viral Population in Chronic Hepatitis C Virus Mapped by Ultradeep Pyrosequencing

Deep Sequencing Data Analysis: Challenges and Solutions

Improved Error Correction of NGS Data

Contact Info

Product

Resources

About