Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU

Mahmoud, Mohammed; Hoffmann, Mark R.; Reza, Hassan

doi:10.3390/computation6030045

Cited by 4 publications

(1 citation statement)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Utilizing the architecture-speci c memory model to reduce its memory bandwidth requirement is a major challenge, especially for highly parallel architectures such as GPUs, where exploiting the regularity in unstructured accesses is key. Numerous prior works have been proposed to improve the performance of SpMV, including that of the development of new sparse representations (Bell and Garland 2009;Mahmoud et al 2018;Sun et al 2011), representation-speci c optimizations (Belgin et al 2009;Bell and Garland 2009;Guo and wei Lee 2016) and architecture-speci c techniques (Baskaran and Bordawekar 2009;Bell and Garland 2009;Liu et al 2013;Mellor-Crummey and Garvin 2004;Shantharam et al 2011;Vuduc and Moon 2005;Williams et al 2007;Wu et al 2013).…”

Section: Introductionmentioning

confidence: 99%

Optimizing the linear fascicle evaluation algorithm for many-core systems

Aggarwal

Bondhugula

2019

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Sparse matrix-vector multiplication (SpMV ) operations are commonly used in various scienti c and engineering applications. e performance of the SpMV operation o en depends on exploiting regularity pa erns in the matrix. Various representations and optimization techniques have been proposed to minimize the memory bandwidth bo leneck arising from the irregular memory access pa ern involved. Among recent representation techniques, tensor decomposition is a popular one used for very large but sparse matrices. Post sparse-tensor decomposition, the new representation involves indirect accesses, making it challenging to optimize for multi-cores and even more demanding for the massively parallel architectures, such as on GPUs.Computational neuroscience algorithms o en involve sparse datasets while still performing long-running computations on them.e Linear Fascicle Evaluation (LiFE) application is a popular neuroscience algorithm used for pruning brain connectivity graphs. e datasets employed herein involve the Sparse Tucker Decomposition (STD) -a widely used tensor decomposition method. Using this decomposition leads to multiple indirect array references, making it very di cult to optimize on both multi-core and many-core systems. Recent implementations of the LiFE algorithm show that its SpMV operations are the key bo leneck for performance and scaling. In this work, we rst propose target-independent optimizations to optimize these SpMV operations, followed by targetdependent optimizations for CPU and GPU systems. e target-independent techniques include: (1) standard compiler optimizations to prevent unnecessary and redundant computations, (2) data restructuring techniques to minimize the e ects of indirect accesses, and(3) methods to partition computations among threads to obtain coarse-grained parallelism with low synchronization overhead. en we present the target-dependent optimizations for CPUs such as: (1) e cient synchronization-free thread mapping, and (2) utilizing BLAS calls to exploit hardware-speci c speed. Following that, we present various GPU-speci c optimizations to optimally map threads at the granularity of warps, thread blocks and grid. Furthermore, to automate the CPU-based optimizations developed for this algorithm, we also extend the PolyMage domain-speci c language, embedded in Python. Our highly optimized and parallelized CPU implementation obtain a speedup of 6.3× over the naive parallel CPU implementation running on 16-core Intel Xeon Silver (Skylake-based) system. In addition to that our optimized GPU implementation achieves a speedup of 5.2× over a reference optimized GPU code version on NVIDIA's GeForce RTX 2080 Ti GPU, and a speedup of 9.7× over our highly optimized and parallelized CPU implementation. We made following novel contributions in this work. First, we generalized the data restructuring methods and computation spli ing techniques, these are extended to CPUs. Second, we present CPU-speci c optimizations to improve the performance of the LiFE applications. ird, we describe DSL bas...

show abstract

Section: Introductionmentioning

confidence: 99%

Optimizing the linear fascicle evaluation algorithm for many-core systems

Aggarwal

Bondhugula

2019

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core Systems

Aggarwal

Bondhugula

2020

ACM Trans. Parallel Comput.

View full text Add to dashboard Cite

Sparse matrix-vector multiplication ( SpMV ) operations are commonly used in various scientific and engineering applications. The performance of the SpMV operation often depends on exploiting regularity patterns in the matrix. Various representations and optimization techniques have been proposed to minimize the memory bandwidth bottleneck arising from the irregular memory access pattern involved. Among recent representation techniques, tensor decomposition is a popular one used for very large but sparse matrices. Post sparse-tensor decomposition, the new representation involves indirect accesses, making it challenging to optimize for multi-cores and even more demanding for the massively parallel architectures, such as on GPUs. Computational neuroscience algorithms often involve sparse datasets while still performing long-running computations on them. The Linear Fascicle Evaluation (LiFE) application is a popular neuroscience algorithm used for pruning brain connectivity graphs. The datasets employed herein involve the Sparse Tucker Decomposition (STD)—a widely used tensor decomposition method. Using this decomposition leads to multiple indirect array references, making it very difficult to optimize on both multi-core and many-core systems. Recent implementations of the LiFE algorithm show that its SpMV operations are the key bottleneck for performance and scaling. In this work, we first propose target-independent optimizations to optimize the SpMV operations of LiFE decomposed using the STD technique, followed by target-dependent optimizations for CPU and GPU systems. The target-independent techniques include: (1) standard compiler optimizations to prevent unnecessary and redundant computations, (2) data restructuring techniques to minimize the effects of indirect array accesses, and (3) methods to partition computations among threads to obtain coarse-grained parallelism with low synchronization overhead. Then, we present the target-dependent optimizations for CPUs such as: (1) efficient synchronization-free thread mapping and (2) utilizing BLAS calls to exploit hardware-specific speed. Following that, we present various GPU-specific optimizations to optimally map threads at the granularity of warps, thread blocks, and grid. Furthermore, to automate the CPU-based optimizations developed for this algorithm, we also extend the PolyMage domain-specific language, embedded in Python. Our highly optimized and parallelized CPU implementation obtains a speedup of 6.3× over the naive parallel CPU implementation running on 16-core Intel Xeon Silver (Skylake-based) system. In addition to that, our optimized GPU implementation achieves a speedup of 5.2× over a reference-optimized GPU code version on NVIDIA’s GeForce RTX 2080 Ti GPU and a speedup of 9.7× over our highly optimized and parallelized CPU implementation.

show abstract

Modern server ARM processors for supercomputers: A64FX and others. Initial data of benchmarks

Kuz’minskii

2022

ПСТП

View full text Add to dashboard Cite

A comparative analysis of the performance of ARM server processors used on supercomputers or also aimed at high-performance computing (HPC) is given. Fujitsu A64FX, Marvell ThunderX2 and Huawei Kunpeng 920 were selected for the initial performance analysis. The HPC performance review focuses primarily on benchmarks and applications for the A64FX, which supports longer vectors than other ARM processors and has higher peak performance. The performance of the A64FX is compared against corresponding data for Intel Xeon Skylake and Cascade Lake, and AMD EPYC with Zen 2 and 3 (Roma and Milan), as well as Nvidia V100 and A100 GPUs. A short set of potential pros and cons of the A64FX microarchitecture has been formulated. Comparison of performance data obtained using different compilers for A64FX. Features have been formed when A64FX usually gives advantages in performance over x86-64, and when it concedes to x86-64. It is clear that the use of A64FX in supercomputers can grow further. There is an assumption that x86-64 hegemony in HPC will decrease, in particular, due to the increased use of server ARM processors. But the analysis of A64FX and new AArch64 processors expected in the near future showed that A64FX will not necessarily lead in this process.

show abstract

Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU

Cited by 4 publications

References 18 publications

Optimizing the linear fascicle evaluation algorithm for many-core systems

Optimizing the linear fascicle evaluation algorithm for many-core systems

Optimizing the Linear Fascicle Evaluation Algorithm for Multi-core and Many-core Systems

Modern server ARM processors for supercomputers: A64FX and others. Initial data of benchmarks

Contact Info

Product

Resources

About