An Efficient GPUAccelerated Implementation of Genomic Short Read Mapping with BWAMEM

Houtgast, Ernst; Sima, Vlad-Mihai; Bertels, Koen; Al-Ars, Zaid

doi:10.1145/3039902.3039910

Cited by 12 publications

(5 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is because ACC greatly reduces the computational bottleneck, which increases the relative effect of the storage subsystem on the end-to-end execution time. The ACC and Ideal-ISF+ACC results clearly show that data movement between the storage devices and the hardware accelerator, which has not been properly considered in prior read mapping accelerators [39,40,43,61,62,65,[70][71][72][73][74][75][76][77], can significantly bottleneck the potential benefits of the accelerator. Comparison to Other Near-Data Processing Systems.…”

Section: Results and Analysismentioning

confidence: 99%

GenStore: a high-performance in-storage processing system for genome sequence analysis

Ghiasi

Park

Mustafa

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

of read mapping processes of reads with different properties and degrees of genetic variation, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flashbased solid-state drive (SSD). Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern NAND flash-based SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05× (1.52-3.32×) for read sets with high similarity to the reference genome and 1.45-33.63× (2.70-19.2×) for read sets with low similarity to the reference genome. CCS CONCEPTS• Computer systems organization → Special purpose systems;• Hardware → External storage.

show abstract

Section: Results and Analysismentioning

confidence: 99%

GenStore: a high-performance in-storage processing system for genome sequence analysis

Ghiasi

Park

Mustafa

et al. 2022

Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

View full text Add to dashboard Cite

show abstract

“…As high-quality algorithms such as BWA-MEM [37] became a de facto standard, the usability of the GPU-aware alignment softwares were limited. Some approaches [13], [29]- [31] tackle this problem and design a seed extension kernel general enough to be used for BWA-MEM using intra-query performance. However, later approaches based on inter-query parallelism outperformed these kernels, which is the strategy adopted by the current state-of-the-art methods such as NVBIO [3] or GASAL2 [9].…”

Section: B Gpu-accelerated Sequence Alignment Softwaresmentioning

confidence: 99%

SaLoBa: Maximizing Data Locality and Workload Balance for Fast Sequence Alignment on GPUs

Park¹,

Kim²,

Ahmad³

et al. 2023

Preprint

View full text Add to dashboard Cite

Sequence alignment forms an important backbone in many sequencing applications. A commonly used strategy for sequence alignment is an approximate string matching with a two-dimensional dynamic programming approach. Although some prior work has been conducted on GPU acceleration of a sequence alignment, we identify several shortcomings that limit exploiting the full computational capability of modern GPUs. This paper presents SALoBa, a GPU-accelerated sequence alignment library focused on seed extension. Based on the analysis of previous work with real-world sequencing data, we propose techniques to exploit the data locality and improve workload balancing. The experimental results reveal that SALoBa significantly improves the seed extension kernel compared to state-of-the-art GPU-based methods.

show abstract

“…To our knowledge the only application-level accelerated integrated implementations of BWA-MEM that exist are: an FPGA-accelerated implementation of the Seed Extension phase [15] achieving a 1.5x speedup, further improved in [16] for an overall 2.6x speedup; and a GPU implementation [9], further improved to achieve an up to 2x speedup [17]. The FPGA implementation used here builds on [15], and a comparison of the implementation here is made to the improved GPU implementation.…”

Section: Related Workmentioning

confidence: 99%

“…Then, the details of the FPGAaccelerated implementation on the Alpha Data card are given. Finally, details of the GPU implementation are briefly discussed (further details can be found in [17]). …”

Section: Architecture Design and Implementationmentioning

confidence: 99%

See 1 more Smart Citation

Power-efficiency analysis of accelerated BWA-MEM implementations on heterogeneous computing platforms

Houtgast

Sima²,

Marchiori³

et al. 2016

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)

Self Cite

View full text Add to dashboard Cite

Abstract-Next Generation Sequencing techniques have dramatically reduced the cost of sequencing genetic material, resulting in huge amounts of data being sequenced. The processing of this data poses huge challenges, both from a performance perspective, as well as from a power-efficiency perspective. Heterogeneous computing can help on both fronts, by enabling more performant and more power-efficient solutions.In this paper, power-efficiency of the BWA-MEM algorithm, a popular tool for genomic data mapping, is studied on two heterogeneous architectures. The performance and powerefficiency of an FPGA-based implementation using a single Xilinx Virtex-7 FPGA on the Alpha Data add-in card is compared to a GPU-based implementation using an NVIDIA GeForce GTX 970 and against the software-only baseline system. By offloading the Seed Extension phase on an accelerator, both implementations are able to achieve a two-fold speedup in overall application-level performance over the software-only implementation. Moreover, the highly customizable nature of the FPGA results in much higher power-efficiency, as the FPGA power consumption is less than one fourth of that of the GPU. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 44 thousand base pairs per Joule, a 2.1x gain in power-efficiency as compared to the software-only baseline.

show abstract

An Efficient GPUAccelerated Implementation of Genomic Short Read Mapping with BWAMEM

Cited by 12 publications

References 11 publications

GenStore: a high-performance in-storage processing system for genome sequence analysis

GenStore: a high-performance in-storage processing system for genome sequence analysis

SaLoBa: Maximizing Data Locality and Workload Balance for Fast Sequence Alignment on GPUs

Power-efficiency analysis of accelerated BWA-MEM implementations on heterogeneous computing platforms

Contact Info

Product

Resources

About