2021
DOI: 10.1101/2021.04.08.438862
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Genomic sequence characteristics and the empiric accuracy of short-read sequencing

Abstract: Background: Short-read whole genome sequencing (WGS) is a vital tool for clinical applications and basic research. Genetic divergence from the reference genome, repetitive sequences, and sequencing bias, reduce the performance of variant calling using short-read alignment, but the loss in recall and specificity has not been adequately characterized. For the clonal pathogen Mycobacterium tuberculosis (Mtb), researchers frequently exclude 10.7% of the genome believed to be repetitive and prone to erroneous varia… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
11
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

3
4

Authors

Journals

citations
Cited by 10 publications
(11 citation statements)
references
References 66 publications
0
11
0
Order By: Relevance
“…Despite the abovementioned effects of quality filters in performance, these are rarely the only parameter taken into consideration when carrying out variant calling in MTBC species. Repeat-rich regions, such as PE/PPE family proteins, mobile genetic elements or direct repeats, are generally considered low confidence regions either due to a higher error rate or mapping issues ( 19 , 38 ), which complicates the variant calling process and could give rise to FN and FP SNPs. Indeed, the majority of erroneous calls in our simulation were identified in repeat-rich sequences, especially in pe/ppe genes and the pks12 gene.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…Despite the abovementioned effects of quality filters in performance, these are rarely the only parameter taken into consideration when carrying out variant calling in MTBC species. Repeat-rich regions, such as PE/PPE family proteins, mobile genetic elements or direct repeats, are generally considered low confidence regions either due to a higher error rate or mapping issues ( 19 , 38 ), which complicates the variant calling process and could give rise to FN and FP SNPs. Indeed, the majority of erroneous calls in our simulation were identified in repeat-rich sequences, especially in pe/ppe genes and the pks12 gene.…”
Section: Discussionmentioning
confidence: 99%
“…There is little information as to how reliable low confidence regions are in phylogenetic inference, as their analysis has led to conflicting conclusions ( 45 , 46 ). Nevertheless, there has been an increasing interest in the usefulness of filtering repeat-rich regions and recent data indicate that more than a half of the masked repetitive regions could be accurately identified using Illumina platforms ( 38 ). Even with the limitations of short-read sequencing platforms, the use of de novo assemblies or more refined masking filters may allow informative SNPs to be identified and retained ( 21 , 38 , 47 ).…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…To exclude genetically similar isolates, we first calculated pairwise SNP distance across all isolates. Among the entire dataset of 24,015 isolates, 703,755 total SNPs were identified of which 50,396 were further excluded because they either had low Empirical Base-level Recall (EBR) score 28 , were located in mobile genetic element regions, or were missing in >10% of isolates (Supplementary Figure 1) with 653,359 SNP sites remaining. We excluded 1,416 isolates that had >=10% missing calls at these SNP sites and further excluded 15,771 SNPs where the minor allele didn't occur in any remaining isolates with 637,588 SNPs remaining among 22,599 total isolates (Supplementary Figure 1).…”
Section: Estimation Of Resistance Burden/antibiograms By Countrymentioning
confidence: 99%
“…Repetitive gene regions including the PP/PPE gene families are generally problematic for short read-sequencers due to challenges associated with mapping repetitive reads and are therefore excluded from analysis ( Meehan et al, 2019 ). However, recent results posted on the preprint server bioRxiv demonstrated that over 65% of these excluded regions can be accurately analyzed on Illumina with high precision (though low recall), besides the PE_PGRS and PPE_MPTR subfamilies ( Marin et al, 2021 ). 1 Although the function of PE/PPE genes remain unknown, they have been considered important in pathogen-host interaction and virulence exclusive to Mtb and should therefore be included in sequencing analyses ( Qian et al, 2020 ).…”
Section: Introductionmentioning
confidence: 99%