Scalable Reference Genome Assembly from Compressed Pan-Genome Index with Spark

Maarala, Altti Ilari; Arasalo, Ossi; Valenzuela, Daniel; Heljanko, Keijo; Mäkinen, Veli

doi:10.1007/978-3-030-59612-5_6

Cited by 4 publications

(3 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Scalability of the workflow can be further improved by adding support to distributing the workload to a cluster. This has already been accomplished for the original PanVC workflow using Spark ( Maarala et al , 2020 ), and we are working with our collaborators to extend the cluster support to include the new founder reconstruction-related parts.…”

Section: Discussionmentioning

confidence: 99%

Founder reconstruction enables scalable and seamless pangenomic analysis

et al. 2021

Self Cite

View full text Add to dashboard Cite

Motivation Variant calling workflows that utilize a single reference sequence are the de facto standard elementary genomic analysis routine for resequencing projects. Various ways to enhance the reference with pangenomic information have been proposed, but scalability combined with seamless integration to existing workflows remains a challenge. Results We present PanVC with founder sequences, a scalable and accurate variant calling workflow based on a multiple alignment of reference sequences. Scalability is achieved by removing duplicate parts up to a limit into a founder multiple alignment, that is then indexed using a hybrid scheme that exploits general purpose read aligners. Our implemented workflow uses GATK or BCFtools for variant calling, but the various steps of our workflow (e.g. vcf2multialign tool, founder reconstruction) can be of independent interest as a basis for creating novel pangenome analysis workflows beyond variant calling. Availability Our open access tools and instructions how to reproduce our experiments are available at the following address: https://github.com/algbio/panvc-founders Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Discussionmentioning

confidence: 99%

Founder reconstruction enables scalable and seamless pangenomic analysis

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this paper, we show how to use PFP to find the thresholds at the same time as we build the r-index. We refer to the final data structure as MONI, from the Finnish for "multi", since our ultimate intention is to index and use multiple genomes as a reference, whereas other approaches to pangenomics (Garrison et al, 2018;Li et al, 2020;Maarala et al, 2020) index models of genomic databases but not the databases themselves. We compare MONI to PuffAligner (Almodaresi et al, 2021), Bowtie2 (Langmead and Salzberg, 2012), BWA-MEM (Li, 2013), and CHIC (Valenzuela and Mäkinen, 2017) using GRCh37 and haplotypes taken from The 1000 Genomes Project Consortium (2015), and the Salmonella genomes taken from GenomeTrakr (Stevens et al, 2017).…”

Section: Introductionmentioning

confidence: 99%

MONI: A Pangenomics Index for Finding MEMs

Rossi

Oliva

Langmead

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding --- but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large sequence collections of highly repetitive sequences. Compared to other read aligners -- PuffAligner, Bowtie2, BWA-MEM, and CHIC -- MONI used 2--11 times less memory and was 2--32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references. Availability: MONI is publicly available at https://github.com/maxrossi91/moni.

show abstract

“…In our previous studies, we have been focusing on the scalable assembling of reference pan-genomes for enabling pan-genomic variant calling utilizing hybrid-index [ 15 ] and scalable searching of viral sequences amongst numerous metagenomes assembled from human samples with ViraPipe [ 16 ]. Here, we focus on the distributed compressed hybrid-indexing and propose a scalable distributed compression and indexing tool for a massive number of assembled genomes with read alignment and sequence matching support.…”

Section: Introductionmentioning

confidence: 99%

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment

et al. 2021

Self Cite

View full text Add to dashboard Cite

Computational pan-genomics utilizes information from multiple individual genomes in large-scale comparative analysis. Genetic variation between case-controls, ethnic groups, or species can be discovered thoroughly using pan-genomes of such subpopulations. Whole-genome sequencing (WGS) data volumes are growing rapidly, making genomic data compression and indexing methods very important. Despite current space-efficient repetitive sequence compression and indexing methods, the deployed compression methods are often sequential, computationally time-consuming, and do not provide efficient sequence alignment performance on vast collections of genomes such as pan-genomes. For performing rapid analytics with the ever-growing genomics data, data compression and indexing methods have to exploit distributed and parallel computing more efficiently. Instead of strict genome data compression methods, we will focus on the efficient construction of a compressed index for pan-genomes. Compressed hybrid-index enables fast sequence alignments to several genomes at once while shrinking the index size significantly compared to traditional indexes. We propose a scalable distributed compressed hybrid-indexing method for large genomic data sets enabling pan-genome-based sequence search and read alignment capabilities. We show the scalability of our tool, DHPGIndex, by executing experiments in a distributed Apache Spark-based computing cluster comprising 448 cores distributed over 26 nodes. The experiments have been performed both with human and bacterial genomes. DHPGIndex built a BLAST index for n = 250 human pan-genome with an 870:1 compression ratio (CR) in 342 minutes and a Bowtie2 index with 157:1 CR in 397 minutes. For n = 1,000 human pan-genome, the BLAST index was built in 1520 minutes with 532:1 CR and the Bowtie2 index in 1938 minutes with 76:1 CR. Bowtie2 aligned 14.6 GB of paired-end reads to the compressed (n = 1,000) index in 31.7 minutes on a single node. Compressing n = 13,375,031 (488 GB) GenBank database to BLAST index resulted in CR of 62:1 in 575 minutes. BLASTing 189,864 Crispr-Cas9 gRNA target sequences (23 MB in total) to the compressed index of human pan-genome (n = 1,000) finished in 45 minutes on a single node. 30 MB mixed bacterial sequences were (n = 599) were blasted to the compressed index of 488 GB GenBank database (n = 13,375,031) in 26 minutes on 25 nodes. 78 MB mixed sequences (n = 4,167) were blasted to the compressed index of 18 GB E. coli sequence database (n = 745,409) in 5.4 minutes on a single node.

show abstract

Scalable Reference Genome Assembly from Compressed Pan-Genome Index with Spark

Cited by 4 publications

References 29 publications

Founder reconstruction enables scalable and seamless pangenomic analysis

Founder reconstruction enables scalable and seamless pangenomic analysis

MONI: A Pangenomics Index for Finding MEMs

Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment

Contact Info

Product

Resources

About