PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions

Olson, Nathan D.; Wagner, Justin; McDaniel, Jennifer; Stephens, Sarah H.; Westreich, Samuel T.; Prasanna, Anish G.; Johanson, Elaine; Boja, Emily S.; Maier, Ezekiel J.; Serang, Omar; Jáspez, David; Lorenzo-Salazar, José M.; Muñoz-Barrera, Adrián; Rubio-Rodríguez, Luis A.; Flores, Carlos; Kyriakidis, Konstantinos; Malousi, Andigoni; Shafin, Kishwar; Pesout, Trevor; Jain, Miten; Paten, Benedict; Chang, Pi-Chuan; Kolesnikov, Alexey; Nattestad, Maria; Baid, Gunjan; Goel, Sidharth; Yang, Howard H.; Carroll, Andrew; Eveleigh, Robert; Bourgey, Mathieu; Bourque, Guillaume; Li, Gen; Ma, Chouxian; Tang, LinQi; Du, Yuanping; Zhang, Shaowei; Morata, Jordi; Tonda, Raúl; Parra, Genı́s; Trotta, Jean-Rémi; Brueffer, Christian; Demirkaya-Budak, Sinem; Kabakci-Zorlu, Duygu; Turgut, Deniz; Kalay, Özem; Budak, Güngör; Narcı, Kübra; Arslan, Elif Acar; Brown, Richard C.; Johnson, Ivan J.; Dolgoborodov, Alexey; Semenyuk, Vladimir; Jain, Amit; Tetikol, H. Serhat; Jain, Varun; Ruehle, Mike; Lajoie, Bryan R.; Roddey, Cooper; Catreux, Severine; Mehio, Rami; Ahsan, Mian Umair; Liu, Qian; Wang, Kai; Sahraeian, Sayed Mohammad Ebrahim; Fang, Li Tai; Mohiyuddin, Marghoob; Hung, Calvin; Jain, Chirag; Feng, Hanying; Li, Zhipan; Chen, Luoqi; Sedlazeck, Fritz J.; Zook, Justin M.

doi:10.1016/j.xgen.2022.100129

Cited by 122 publications

(104 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The seven reference samples of the Genome-in-a-Bottle (GIAB) consortium ( 29 ), comprising two trios, are among the most extensively sequenced samples in the world. Reference data from these samples are, for example, used in the PrecisionFDA truth challenge to determine the accuracy of variant calls ( 40 ). For variants called from TGS data, the mean PrecisionFDA recall and precision rates were 96,02% (95.53–97.47%, SD = 0.61%) and 98.79% (98.28–99.23%, SD = 0.28%), respectively.…”

Section: Resultsmentioning

confidence: 99%

“…We benchmarked our method using fully elucidated GIAB reference samples, including the Ashkenazim Jewish and the Han Chinese trios. Variant calling concordance was high for GIAB references in terms of recall (96,02%) and precision (98.79%) rates according to the PrecisionFDA Truth Challenge ( 40 ). The phenotypic blood group results from genotyping array for eight of the unknown patients perfectly matched the trivialised TGS results.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

High-throughput method for the hybridisation-based targeted enrichment of long genomic fragments for PacBio third-generation sequencing

Steiert

Fuß

Juzėnas

et al. 2022

NAR Genomics and Bioinformatics

View full text Add to dashboard Cite

Hybridisation-based targeted enrichment is a widely used and well-established technique in high-throughput second-generation short-read sequencing. Despite the high potential to genetically resolve highly repetitive and variable genomic sequences by, for example PacBio third-generation sequencing, targeted enrichment for long fragments has not yet established the same high-throughput due to currently existing complex workflows and technological dependencies. We here describe a scalable targeted enrichment protocol for fragment sizes of >7 kb. For demonstration purposes we developed a custom blood group panel of challenging loci. Test results achieved > 65% on-target rate, good coverage (142.7×) and sufficient coverage evenness for both non-paralogous and paralogous targets, and sufficient non-duplicate read counts (83.5%) per sample for a highly multiplexed enrichment pool of 16 samples. We genotyped the blood groups of nine patients employing highly accurate phased assemblies at an allelic resolution that match reference blood group allele calls determined by SNP array and NGS genotyping. Seven Genome-in-a-Bottle reference samples achieved high recall (96%) and precision (99%) rates. Mendelian error rates were 0.04% and 0.13% for the included Ashkenazim and Han Chinese trios, respectively. In summary, we provide a protocol and first example for accurate targeted long-read sequencing that can be used in a high-throughput fashion.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

High-throughput method for the hybridisation-based targeted enrichment of long genomic fragments for PacBio third-generation sequencing

Steiert

Fuß

Juzėnas

et al. 2022

NAR Genomics and Bioinformatics

View full text Add to dashboard Cite

show abstract

“…Genome graphs are better suited for expressing the the genomic regions that have SNPs, indels and SVs than a linear reference sequence [36] since genome graphs combine the linear reference genome with the known genetic variations in the entire population as a graph-based data structure. Therefore, there is a growing trend towards using genome graphs [36,51,54,56,61,62,65,66,124,125] to more accurately express the genetic diversity in a population. With increasing importance and usage of genome graphs, having accurate and efficient tools for mapping genomic sequences to these graphs has become crucial.…”

Section: Graph-based Genome Sequence Analysismentioning

confidence: 99%

“…Multiple outgoing directed edges from a node captures genetic variations. Genome graphs are growing in popularity for a number of genomic applications, such as (1) variant calling [36,54,56], which identifies the genomic differences between the sequenced genome and the reference genome; (2) genome assembly [51,[57][58][59], which reconstructs the entire sequenced genome using the reads without utilizing a known reference genome sequence; (3) error correction [60][61][62], which corrects the noisy regions in long reads due to sequencing errors; and (4) multiple sequence alignment [63][64][65], which aligns three or more biological sequences of similar length. With the increasing importance and usage of genome graphs, having fast and efficient techniques and tools for mapping genomic sequences to genome graphs is now crucial.…”

Section: Introductionmentioning

confidence: 99%

SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping

Cali,

Kanellopoulos,

Lindegger

et al. 2022

Preprint

View full text Add to dashboard Cite

A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-tosequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available.We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping. Since sequence-to-sequence mapping can be treated as a special case of sequence-to-graph mapping, we aim to design an accelerator that is efficient for both linear and graph-based read mapping.To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequenceto-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator, which finds the candidate locations in a given genome graph; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator, which performs alignment between a given read and the subgraph identified by MinSeed. We couple SeGraM with high-bandwidth memory to exploit low latency and highlyparallel memory access, which alleviates the memory bottleneck.

show abstract

“…The HiSAT2 (Kim et al, 2019) aligner has been previously shown to be highly accurate with improvements over Bowtie2 (Musich et al, 2021). Dragen (Illumina, Inc., 2021) has been further optimised for accurate variant calling in difficult to map regions of the genome and improved over previous versions in the PrecisionFDA Truth challenge V2 (Illumina, Inc., 2020a; Illumina, Inc., 2020b; Olson et al, 2021; Wagner et al, 2021). Introducing these aligners to the HiCUP+ pipeline allows these highly accurate mapping tools to be used in existing Hi-C data processing workflows.…”

Section: Accuracy and Reproducibilitymentioning

confidence: 99%

HiCUP-Plus: a fast open-source pipeline for accurately processing large scale Hi-C sequence data

Kelly

Yuhara

2022

Preprint

View full text Add to dashboard Cite

Hi-C is an unbiased genome-wide assay to study 3D chromosome conformation and gene-regulation. The HiCUP pipeline is an open-source tool to process Hi-C from massively parallel sequencing while accounting for biases specific to the restriction enzyme digests used. It is an excellent solution tailored to analyse this technique, however the latest aligner supported by the current release is Bowtie2. To improve the computational performance and mapping accuracy when using the HiCUP pipeline, we have modified it to optionally call the HiSAT2 and Dragen aligners. This allows using the HiCUP pipeline with 3rd party aligners, including the commercially-licensed high performance Dragen aligner. The HiCUP+ pipeline is modified extensively to be compatible with Dragen outputs while ensuring that the same results as the original pipeline can be reproduced with the Bowtie or Bowtie2 aligners. Using the highly accurate HiSAT2 or Dragen aligners produces larger outputs with a higher proportion of uniquely mapped read pairs. It is therefore feasible to leverage the reduced compute-time of Dragen to reduce compute costs and turnaround-time without compromising quality of results. The HiCUP pipeline and Dragen both compute rich summary information.

show abstract

PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions

Cited by 122 publications

References 34 publications

High-throughput method for the hybridisation-based targeted enrichment of long genomic fragments for PacBio third-generation sequencing

High-throughput method for the hybridisation-based targeted enrichment of long genomic fragments for PacBio third-generation sequencing

SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping

HiCUP-Plus: a fast open-source pipeline for accurately processing large scale Hi-C sequence data

Contact Info

Product

Resources

About