2020
DOI: 10.1099/mgen.0.000368
|View full text |Cite
|
Sign up to set email alerts
|

prewas: data pre-processing for more informative bacterial GWAS

Abstract: While variant identification pipelines are becoming increasingly standardized, less attention has been paid to the pre-processing of variants prior to their use in bacterial genome-wide association studies (bGWAS). Three nuances of variant pre-processing that impact downstream identification of genetic associations include the separation of variants at multiallelic sites, separation of variants in overlapping genes, and referencing of variants relative to ancestral alleles. Here we demonstrate the importance o… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
3
3

Relationship

4
2

Authors

Journals

citations
Cited by 12 publications
(15 citation statements)
references
References 41 publications
0
15
0
Order By: Relevance
“…To this end, we independently evaluated patient characteristics as well as three different genomic feature sets for their ability to classify colonization and infection. The three genomic feature sets were uncurated genomic (including SNPs, indels, IS elements, and accessory genes), uncurated grouped genomic (variants grouped into genes, akin to a burden test, e.g (28)), and curated genomic (features identified using Kleborate (16)). Across the 100 different train/test splits, we observed that the average predictive performance was weak, with each of the genomic and patient feature sets predictive of infection to a similar degree (all 1st quartile AUROCs > 0.5; median range=0.55-0.68; Figure 2A ; AUPRC: Figure S3A ).…”
Section: Resultsmentioning
confidence: 99%
“…To this end, we independently evaluated patient characteristics as well as three different genomic feature sets for their ability to classify colonization and infection. The three genomic feature sets were uncurated genomic (including SNPs, indels, IS elements, and accessory genes), uncurated grouped genomic (variants grouped into genes, akin to a burden test, e.g (28)), and curated genomic (features identified using Kleborate (16)). Across the 100 different train/test splits, we observed that the average predictive performance was weak, with each of the genomic and patient feature sets predictive of infection to a similar degree (all 1st quartile AUROCs > 0.5; median range=0.55-0.68; Figure 2A ; AUPRC: Figure S3A ).…”
Section: Resultsmentioning
confidence: 99%
“…We preprocessed variants to include multiallelic sites and used the major allele method for variant binarization, as described in Saund et al 35 We used SnpEff to predict the functional impact of single nucleotide variants and indels (high, moderate, low, modifier). 36 Additionally, we considered all insertions in or upstream of genes as high impact, and those downstream of genes as moderate impact.…”
Section: Variant Preprocessingmentioning
confidence: 99%
“…An optional argument may be supplied to facilitate grouping genotypes. The genotype matrix and tree can be prepared from a multiVCF file by the variant preprocessing tool prewas(15). Hogwash assumes that the genotype is encoded such that 0 refers to wild type and 1 refers to a mutation and that binary phenotypes are encoded such that 0 refers to absence and 1 refers to presence.…”
Section: Package Descriptionmentioning
confidence: 99%
“…Users may supply hogwash with data that was previously grouped (for example, using the group SNPs by gene functionality in prewas(15)) but this approach may mask some genotype transitions. In this case, the user does not need to provide a key and the hogwash grouping step is skipped.…”
Section: Package Descriptionmentioning
confidence: 99%