Memory-driven computing accelerates genomic data processing

Becker, Matthias; Chabbi, Milind; Warnat-Herresthal, Stefanie; Klee, Kathrin; Schulte-Schrepping, Jonas; Biernat, Paweł; Günther, Patrick; Babler, Kristina M.; Craig, Rory J.; Schultze, Hartmut; Singhal, Sharad; Ulas, Thomas; Schultze, Joachim L.

doi:10.1101/519579

Cited by 5 publications

(4 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other research focuses on innovative hardware platforms to execute genomics algorithms more efficiently [16]. In [17] a large pool of different types of memories are created and connected to processing resources through the Gen-Z communication protocol to investigate the concept of memory-driven computing. The memory is shared across running processes to avoid intermediate I/O operations.…”

Section: Discussionmentioning

confidence: 99%

ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow

Ahmad

Ahmed

Peltenburg

et al. 2020

2020 3rd International Conference on Computer Applications &Amp; Information Security (ICCAIS)

View full text Add to dashboard Cite

The rapidly growing size of genomics data bases, driven by advances in sequencing technologies, demands fast and cost-effective processing. However, processing this data creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Traditionally, due to cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor. In this work, we show how the commonly used genomics data format, Sequence Alignment/Map (SAM), can be presented in the Apache Arrow in-memory data representation to benefit of in-memory processing and to ensure better scalability through shared memory objects, by avoiding large (de)-serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we propose ArrowSAM, an in-memory SAM format that uses the Apache Arrow framework, and integrate it into genome pre-processing pipelines including BWA-MEM, Picard and Sambamba. Results show 15x and 2.4x speedups as compared to Picard and Sambamba, respectively. The code and scripts for running all workflows are freely available at https://github.com/abs-tudelft/ArrowSAM.

show abstract

Section: Discussionmentioning

confidence: 99%

ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow

Ahmad

Ahmed

Peltenburg

et al. 2020

2020 3rd International Conference on Computer Applications &Amp; Information Security (ICCAIS)

View full text Add to dashboard Cite

show abstract

“…This systems also allows byte-addressability and load/store instructions to access memory. [49] used a Gen-Z enabled platform for genomics and reported 5.9x speedup over the SAMtools baseline implementation for a number of DNA assembly algorithms. The source code is not available.…”

Section: Related Workmentioning

confidence: 99%

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

et al. 2020

View full text Add to dashboard Cite

Background Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. Implementation We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. Results Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. Availability The code and scripts used in our experiments are available in both container and repository form at: https://github.com/abs-tudelft/ArrowSAM.

show abstract

“…They also compare the results on cluster to show that this platform is salable for high performance computing infrastructure and cost efficient. Memory-driven computing [25], in this research a huge pool of different types of memories created and connected to the processing resources through Gen-Z communication protocol. The memory is shared across are the running processes to avoid intermediate I/O operations.…”

Section: Related Workmentioning

confidence: 99%

ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow

Ahmad

Ahmed

Peltenburg

et al. 2019

Preprint

View full text Add to dashboard Cite

The rapidly growing human genomics data driven by advances in sequencing technologies demands fast and costeffective processing. However, processing this data brings some challenges particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Previously, due to the cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor.In this work, we show how commonly used genomics data format, Sequence Alignment/Map (SAM) can be presented in the Apache Arrow in-memory data representation to take benefits of in-memory processing to ensure the better scalability through shared memory Plasma Object Store by avoiding huge (de)serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we presented an inmemory SAM representation, we called it ArrowSAM, Apache Arrow framework is integrated into genome pre-processing applications including BWA-MEM, Sorting and Picard as use cases to show the advantages of ArrowSAM. Our implementation comprises three components, First, We integrated Apache Arrow into BWA-MEM to write output SAM data in ArrowSAM. Secondly, we sorted all the ArrowSAM data by their coordinates in parallel through pandas dataframes. Finally, Apache Arrow is integrated into HTSJDK library (used in Picard for disk I/O handling), where all ArrowSAM data is processed in parallel for duplicates removal. This implementation gives promising performance improvements for genome data pre-processing in term of both, speedup and system resource utilization. Due to columnar data format, better cache locality is exploited in both applications and shared memory objects enable parallel processing.

show abstract

Memory-driven computing accelerates genomic data processing

Cited by 5 publications

References 23 publications

ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow

ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow

Contact Info

Product

Resources

About