2009
DOI: 10.1101/gr.092072.109
|View full text |Cite
|
Sign up to set email alerts
|

A probabilistic approach for SNP discovery in high-throughput human resequencing data

Abstract: New high-throughput sequencing technologies are generating large amounts of sequence data, allowing the development of targeted large-scale resequencing studies. For these studies, accurate identification of polymorphic sites is crucial. Heterozygous sites are particularly difficult to identify, especially in regions of low coverage. We present a new strategy for identifying heterozygous sites in a single individual by using a machine learning approach that generates a heterozygosity score for each chromosomal… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2009
2009
2016
2016

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 21 publications
(18 citation statements)
references
References 20 publications
0
18
0
Order By: Relevance
“…moschata ‘Burpee’s Butterbush’ F 2 population in order to anchor markers for downstream analyses. Using a custom python script, genotypes represented by less than seven reads were converted to missing data in order to reduce errors associated with under-calling or the false identification of heterozygous loci, common problems for low-coverage loci [44, 45]. Seven reads is the minimum number required to call a heterozygote using at least two reads of the “less tagged allele” based on the binomial likelihood ratio employed in TASSEL and assuming a sequencing error rate of 1%, a conservative estimate for Illumina sequencing [43, 46].…”
Section: Methodsmentioning
confidence: 99%
“…moschata ‘Burpee’s Butterbush’ F 2 population in order to anchor markers for downstream analyses. Using a custom python script, genotypes represented by less than seven reads were converted to missing data in order to reduce errors associated with under-calling or the false identification of heterozygous loci, common problems for low-coverage loci [44, 45]. Seven reads is the minimum number required to call a heterozygote using at least two reads of the “less tagged allele” based on the binomial likelihood ratio employed in TASSEL and assuming a sequencing error rate of 1%, a conservative estimate for Illumina sequencing [43, 46].…”
Section: Methodsmentioning
confidence: 99%
“…ProbHD proposed by Hoberman and colleagues [64] used a machine learning approach that considers multiple features to generate a heterozygosity score for each base. Their method, designed specifically for Roche 454 data, considers a large number of features including total read depth, strand-specific depths, read cycle (within-read relative position), per-base quality scores, read alignment quality, and homopolymer length.…”
Section: Methods For Snp Detection and Genotype Callingmentioning
confidence: 99%
“…Statistical classification methods also use these aforementioned factors as predictive features to discriminate SNPs from sequencing errors. For example, Atlas-SNP2 (14) built a logistic regression model to predict SNPs, whereas ProbHD (15) built a random forest classifier. However, all these methods would fail to detect SNPs if the alignments were incorrect.…”
Section: Introductionmentioning
confidence: 99%