2022
DOI: 10.1093/bioinformatics/btac023
|View full text |Cite|
|
Sign up to set email alerts
|

Benchmarking the empirical accuracy of short-read sequencing across theM. tuberculosisgenome

Abstract: Motivation Short-read whole genome sequencing (WGS) is a vital tool for clinical applications and basic research. Genetic divergence from the reference genome, repetitive sequences, and sequencing bias reduce the performance of variant calling using short-read alignment, but the loss in recall and specificity has not been adequately characterized. To benchmark short-read variant calling, we used 36 diverse clinical Mycobacterium tuberculosis (Mtb) isolates dually sequenced with Illumina short… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
15
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 20 publications
(17 citation statements)
references
References 33 publications
2
15
0
Order By: Relevance
“…On this basis, 150 out of 169 pe / ppe genes with good coverage (>0.7 normalized mean coverage) were included to complement the genomic regions analysed and therefore potentially achieve a deeper separation of the transmission clusters. These regions overlapped with previous studies [ 16 , 22 ]. An extra 568 high-quality SNPs were added, resulting in one additional SNP within the transmission cluster from L2 (S8, S9) and four extra SNPs for L3 (S2, S3, S4), thereby slightly increasing the differences obtained within highly similar samples ( Figure 2C ).…”
Section: Resultssupporting
confidence: 86%
See 1 more Smart Citation
“…On this basis, 150 out of 169 pe / ppe genes with good coverage (>0.7 normalized mean coverage) were included to complement the genomic regions analysed and therefore potentially achieve a deeper separation of the transmission clusters. These regions overlapped with previous studies [ 16 , 22 ]. An extra 568 high-quality SNPs were added, resulting in one additional SNP within the transmission cluster from L2 (S8, S9) and four extra SNPs for L3 (S2, S3, S4), thereby slightly increasing the differences obtained within highly similar samples ( Figure 2C ).…”
Section: Resultssupporting
confidence: 86%
“…Blind spots for Illumina sequencing technologies have been previously reported [ 18 ], for which long-read sequencing technologies can assist [ 20 , 21 ]. In accordance with previous work [ 21 ], our study demonstrates that long-read data has the potential to elucidate complex regions, such as pe / ppe genes, which due to their GC-rich and repetitive nature have been systematically excluded from WGS analysis, losing potential phylogenetic information [ 16 , 22 ]. Coverage of the Illumina replicates on these regions, and more specifically in the most diverse genes of these two families, was shown to be significantly lower than their ONT counterparts, supporting the potential inclusion of these genes for the downstream analysis in WGS from ONT.…”
Section: Discussionsupporting
confidence: 89%
“…Recently, few studies characterized those specific regions in the genome, showing that they are close to each other and present a homologous sequence (percent identity of 81%) due to gene duplication, indicating that they could potentially present critical issues with every technology (Karboul et al, 2008;Phelan et al, 2016;de Maio et al, 2020). Interestingly, the remaining PE and PPE regions showed an overall acceptable coverage for SRS and as already described in other studies, the common practice of excluding those genes from the analysis, due to the high GC-content and the repetitive sequences, could be overcome by removing only the PE_PGRS genes (Modlin et al, 2021;Marin et al, 2022).…”
Section: Discussionmentioning
confidence: 90%
“…The same technology has been used to investigate tuberculosis outbreaks and transmission dynamics by adopting whole-genome SNP (wgSNP) or core genome Multi-Locus Sequence Typing (cgMLST) schemes assessing genetic relatedness of MTB genomes ( Kohl et al, 2014 , 2018 ). However, short-reads technologies are not able to fully resolve hard-to-sequence regions, because has suboptimal capacity to resolve reliably large structural variations, gene duplications, or variations in repetitive regions ( Modlin et al, 2021 ), thereby reducing coverage depth involving a lack of characterization in terms of drug resistance, virulence, and transmission analysis ( Medha et al, 2021 ; Marin et al, 2022 ). Accurately resolving such regions becomes critical to close bacterial genomes, obtaining more information about virulence, evolutionary mechanisms of drug resistance, and on strain relatedness.…”
Section: Introductionmentioning
confidence: 99%
“…We assessed the congruence in variant calls between short-read Illumina data and long-read PacBio data for a set of isolates that underwent sequencing with both technologies (Marin et al, 2022). Using 31 isolates for which both Illumina and a complete PacBio assembly were available, we evaluated the empirical base-pair recall (EBR) of all base-pair positions of the H37rv reference genome.…”
Section: Methodsmentioning
confidence: 99%