2020
DOI: 10.1007/978-3-030-59612-5_6
|View full text |Cite
|
Sign up to set email alerts
|

Scalable Reference Genome Assembly from Compressed Pan-Genome Index with Spark

Abstract: High-throughput sequencing (HTS) technologies have enabled rapid sequencing of genomes and large-scale genome analytics with massive data sets. Traditionally, genetic variation analyses have been based on the human reference genome assembled from a relatively small human population. However, genetic variation could be discovered more comprehensively by using a collection of genomes i.e., pan-genome as a reference. The pan-genomic references can be assembled from larger populations or a specific population unde… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 29 publications
0
3
0
Order By: Relevance
“…Scalability of the workflow can be further improved by adding support to distributing the workload to a cluster. This has already been accomplished for the original PanVC workflow using Spark ( Maarala et al , 2020 ), and we are working with our collaborators to extend the cluster support to include the new founder reconstruction-related parts.…”
Section: Discussionmentioning
confidence: 99%
“…Scalability of the workflow can be further improved by adding support to distributing the workload to a cluster. This has already been accomplished for the original PanVC workflow using Spark ( Maarala et al , 2020 ), and we are working with our collaborators to extend the cluster support to include the new founder reconstruction-related parts.…”
Section: Discussionmentioning
confidence: 99%
“…In this paper, we show how to use PFP to find the thresholds at the same time as we build the r-index. We refer to the final data structure as MONI, from the Finnish for "multi", since our ultimate intention is to index and use multiple genomes as a reference, whereas other approaches to pangenomics (Garrison et al, 2018;Li et al, 2020;Maarala et al, 2020) index models of genomic databases but not the databases themselves. We compare MONI to PuffAligner (Almodaresi et al, 2021), Bowtie2 (Langmead and Salzberg, 2012), BWA-MEM (Li, 2013), and CHIC (Valenzuela and Mäkinen, 2017) using GRCh37 and haplotypes taken from The 1000 Genomes Project Consortium (2015), and the Salmonella genomes taken from GenomeTrakr (Stevens et al, 2017).…”
Section: Introductionmentioning
confidence: 99%
“…In our previous studies, we have been focusing on the scalable assembling of reference pan-genomes for enabling pan-genomic variant calling utilizing hybrid-index [ 15 ] and scalable searching of viral sequences amongst numerous metagenomes assembled from human samples with ViraPipe [ 16 ]. Here, we focus on the distributed compressed hybrid-indexing and propose a scalable distributed compression and indexing tool for a massive number of assembled genomes with read alignment and sequence matching support.…”
Section: Introductionmentioning
confidence: 99%