Halvade: scalable sequence analysis with MapReduce

Decap, Dries; Reumers, Joke; Herzeel, Charlotte; Costanza, Pascal; Fostier, Jan

doi:10.1093/bioinformatics/btv179

Cited by 74 publications

(58 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition to optimizations of the traditional variant calling algorithms [10–13], the community also has been calling for a variant calling toolkit that can take advantage of dedicated MapReduce platforms, as Hadoop [23] and especially Spark [24–26] are more appropriate for this type of genomic data analysis compared to traditional high performance computing (HPC). Thus GATK4, first officially released in January of 2018, is meant to be eventually deployed on data analytics platforms.…”

Section: Introductionmentioning

confidence: 99%

Recommendations for performance optimizations when using GATK3.8 and GATK4

et al. 2019

View full text Add to dashboard Cite

BackgroundUse of the Genome Analysis Toolkit (GATK) continues to be the standard practice in genomic variant calling in both research and the clinic. Recently the toolkit has been rapidly evolving. Significant computational performance improvements have been introduced in GATK3.8 through collaboration with Intel in 2017. The first release of GATK4 in early 2018 revealed rewrites in the code base, as the stepping stone toward a Spark implementation. As the software continues to be a moving target for optimal deployment in highly productive environments, we present a detailed analysis of these improvements, to help the community stay abreast with changes in performance.ResultsWe re-evaluated multiple options, such as threading, parallel garbage collection, I/O options and data-level parallelization. Additionally, we considered the trade-offs of using GATK3.8 and GATK4. We found optimized parameter values that reduce the time of executing the best practices variant calling procedure by 29.3% for GATK3.8 and 16.9% for GATK4. Further speedups can be accomplished by splitting data for parallel analysis, resulting in run time of only a few hours on whole human genome sequenced to the depth of 20X, for both versions of GATK. Nonetheless, GATK4 is already much more cost-effective than GATK3.8. Thanks to significant rewrites of the algorithms, the same analysis can be run largely in a single-threaded fashion, allowing users to process multiple samples on the same CPU.ConclusionsIn time-sensitive situations, when a patient has a critical or rapidly developing condition, it is useful to minimize the time to process a single sample. In such cases we recommend using GATK3.8 by splitting the sample into chunks and computing across multiple nodes. The resultant walltime will be nnn.4 hours at the cost of $41.60 on 4 c5.18xlarge instances of Amazon Cloud. For cost-effectiveness of routine analyses or for large population studies, it is useful to maximize the number of samples processed per unit time. Thus we recommend GATK4, running multiple samples on one node. The total walltime will be ∼34.1 hours on 40 samples, with 1.18 samples processed per hour at the cost of $2.60 per sample on c5.18xlarge instance of Amazon Cloud.

show abstract

Section: Introductionmentioning

confidence: 99%

Recommendations for performance optimizations when using GATK3.8 and GATK4

et al. 2019

View full text Add to dashboard Cite

show abstract

“…elPrep 4 achieves its speedups while offering the flexibility to freely plug pipeline steps in or out, and producing the same results as reference implementations of these steps in GATK 4, Picard, and SAMtools. elPrep 4 works with community-defined standards such as SAM/BAM/VCF/BED rather than defining its own formats for achieving its speedups, making elPrep 4 (backwards) compatible with other standard tools and workflows [7, 23, 24]. …”

Section: Discussionmentioning

confidence: 99%

elPrep 4: A multithreaded framework for sequence analysis

et al. 2019

Self Cite

View full text Add to dashboard Cite

We present elPrep 4, a reimplementation from scratch of the elPrep framework for processing sequence alignment map files in the Go programming language. elPrep 4 includes multiple new features allowing us to process all of the preparation steps defined by the GATK Best Practice pipelines for variant calling. This includes new and improved functionality for sorting, (optical) duplicate marking, base quality score recalibration, BED and VCF parsing, and various filtering options. The implementations of these options in elPrep 4 faithfully reproduce the outcomes of their counterparts in GATK 4, SAMtools, and Picard, even though the underlying algorithms are redesigned to take advantage of elPrep’s parallel execution framework to vastly improve the runtime and resource use compared to these tools. Our benchmarks show that elPrep executes the preparation steps of the GATK Best Practices up to 13x faster on WES data, and up to 7.4x faster for WGS data compared to running the same pipeline with GATK 4, while utilizing fewer compute resources.

show abstract

“…Many sequence aligners which use big data technologies like Apache Hadoop and Spark were implemented in last few years. CloudBurst [22], CloudAligner [23], Halvade [27], SEAL [33], BigBWA [25] and SparkBWA [26] are mostly used sequence aligners which use big data technologies.…”

Section: Related Workmentioning

confidence: 99%

“…Sequence alignment tools like BigBWA [25], Halvade [27] and SparkBWA [26] are very accurate but they suffer from high time/space complexity for index generation.…”

mentioning

confidence: 99%