2018
DOI: 10.1093/gbe/evy199
|View full text |Cite
|
Sign up to set email alerts
|

Turning Vice into Virtue: Using Batch-Effects to Detect Errors in Large Genomic Data Sets

Abstract: It is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling data sets with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear as SNPs. Here, we devise a method to detect systematic errors in combined data sets. To measure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pa… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
8
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 7 publications
(9 citation statements)
references
References 33 publications
1
8
0
Order By: Relevance
“…Differences in sequencing strategies (e.g. read length) and bioinformatic processing have been shown to generate batch effects and dramatically affect downstream analyses [5659]. Another well known bias in population genetics is ascertainment bias which arises when the studied variants were ascertained in selected populations only, and can substantially impact measurements of heterozygosity and related methods [60].…”
Section: Discussionmentioning
confidence: 99%
“…Differences in sequencing strategies (e.g. read length) and bioinformatic processing have been shown to generate batch effects and dramatically affect downstream analyses [5659]. Another well known bias in population genetics is ascertainment bias which arises when the studied variants were ascertained in selected populations only, and can substantially impact measurements of heterozygosity and related methods [60].…”
Section: Discussionmentioning
confidence: 99%
“…However, we see qualitatively similar patterns in 4 of the 5 populations from the 1KGP. Additionally, previous work has suggested that low-frequency errors that are the consequence of batch effects tend to be in positive LD with each other [41] . Also, variants that are identified as error candidates are more likely to be NS variants [41] .…”
Section: Discussionmentioning
confidence: 96%
“…Additionally, previous work has suggested that low-frequency errors that are the consequence of batch effects tend to be in positive LD with each other [41] . Also, variants that are identified as error candidates are more likely to be NS variants [41] . With this in mind, we suspect that sequencing errors would bias our analysis of pairs of low frequency variants annotated as being NS towards being more often in positive LD than in negative LD.…”
Section: Discussionmentioning
confidence: 96%
“…Mafessoni et al recently identified batch effects in the 1kGP by looking for individuals with excess LD among distant variants. (Mafessoni et al, 2018). Here we point out how these and additional unreported batch effects in the early phases of the 1kGP lead to incorrect genetic conclusions through population genetic analyses and spurious GWAS associations as a result of imputation using the 1kGP as a reference.…”
Section: Introduction Batch Effects In Aging Reference Cohort Datamentioning
confidence: 85%
“…A recent publication by Mafessoni et al also identified a batch effect in the 1kGP using a method that uses linkage disequilibrium rather than quality metrics to identify 19,196 suspicious variants with 67% of them passing the 1kGP strict mask (Mafessoni et al, 2018) (Figure S15A). They identify 17,917 variants significantly associated to abnormal LD patterns that are not associated to Q.…”
Section: Identifying Suspicious Variants In the 1000 Genomes Projectmentioning
confidence: 99%