2010
DOI: 10.1089/cmb.2010.0127
|View full text |Cite
|
Sign up to set email alerts
|

EDAR: An Efficient Error Detection and Removal Algorithm for Next Generation Sequencing Data

Abstract: Genomic sequencing techniques introduce experimental errors into reads which can mislead sequence assembly efforts and complicate the diagnostic process. Here we present a method for detecting and removing sequencing errors from reads generated in genomic shotgun sequencing projects prior to sequence assembly. For each input read, the set of all length k substrings (k-mers) it contains are calculated. The read is evaluated based on the frequency with which each k-mer occurs in the complete data set (k-count). … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
36
0

Year Published

2011
2011
2018
2018

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 34 publications
(36 citation statements)
references
References 22 publications
0
36
0
Order By: Relevance
“…In stage 1, the set of k-mers (substring of fixed length k) of reads from the processed data set is calculated and the distribution of frequencies of k-mers is analyzed (31). It was previously observed that the frequencies of erroneous and correct k-mers follow different distributions (32)(33)(34). Based on this fact, the error threshold is calculated as the minimal frequency of k-mers separating two different distributions.…”
Section: Methodsmentioning
confidence: 99%
“…In stage 1, the set of k-mers (substring of fixed length k) of reads from the processed data set is calculated and the distribution of frequencies of k-mers is analyzed (31). It was previously observed that the frequencies of erroneous and correct k-mers follow different distributions (32)(33)(34). Based on this fact, the error threshold is calculated as the minimal frequency of k-mers separating two different distributions.…”
Section: Methodsmentioning
confidence: 99%
“…Available quality control software allow the user to completely remove these duplicates (FASTX -toolkit; or mark them for downstream analysis consideration (PICARD). Recently various algorithms utilizing suffix tree data structures were developed for sequencing error correction (Kelley et al, 2010;Zhao et al, 2010). A common procedure in the pre-analysis process, following initial quality control, and prior to sequence duplication removal, is the compulsory tag / adapter removal (Lassmann et al, 2009;Schmieder et al, 2010) and optional quality trimming.…”
Section: Pre-analysis Processingmentioning
confidence: 99%
“…[99,11] On the other hand, the exact error rate for real data can only be estimated. [53] Gain/Specificity/Sensitivity…”
Section: Methodsmentioning
confidence: 99%
“…EDAR [99] removes low quality reads and, from the remaining data, calculates the coverage for all possible k-mers. Using the variable bandwidth mean-shift method [100] for each read, EDAR clusters the k-mers and set each cluster as erroneous or correct using a threshold derived from the normalized distribution of the coverage.…”
Section: K-spectrum Based (Ksb)mentioning
confidence: 99%
See 1 more Smart Citation