Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

Benatia, Akrem; Ji, Weixing; Wang, Yizhuo; Feng, Shi

doi:10.1177/1094342019886628

Cited by 14 publications

(5 citation statements)

References 32 publications

(68 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SpMV in Commodity Systems. Numerous prior works propose optimized SpMV algorithms for CPUs [5, 37, 59, 60, 62, 63, 108, 136, 165, 171, 172, 182, 193, 204, 209, 235-237, 245, 247, 250, 251, 255, 256, 274], GPUs [18,27,48,61,70,91,107,162,203,227,231,233,243,253,260,261,265], heterogeneous CPU-GPU systems [10,19,34,116,117,202,262,264], and distributed CPU systems [24,28,38,40,85,125,150,161,183,196,201,242]. Optimized SpMV kernels for processorcentric CPU and GPU systems exploit the shared memory model of these systems and data locality in deep cache hierarchies.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Giannoula¹,

Fernandez²,

Gómez-Luna³

et al. 2022

Preprint

View full text Add to dashboard Cite

Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures, after decades of research efforts. Near-bank PIM architectures place simple cores close to DRAM banks. Recent research demonstrates that they can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the Sparse Matrix Vector Multiplication (SpMV) kernel. SpMV has been characterized as one of the most significant and thoroughly studied scientific computation kernels. It is primarily a memory-bound kernel with intensive memory accesses due its algorithmic nature, the compressed matrix format used, and the sparsity patterns of the input matrices given.This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core, including (1) various compressed matrix formats, (2) load balancing schemes across parallel threads and (3) synchronization approaches, and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to an Intel Xeon CPU and an NVIDIA Tesla V100 GPU to study the performance and energy efficiency of various devices, i.e., both memory-centric PIM systems and conventional processor-centric CPU/GPU systems, for the SpMV kernel. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, i.e., CSR, COO, BCSR and BCOO, and a wide range of data types. SparseP is publicly and freely available at https://github.com/CMU-SAFARI/SparseP. Our extensive evaluation using 26 matrices with various sparsity patterns provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems.

show abstract

Section: Related Workmentioning

confidence: 99%

“…2D Partitioning Techniques. We analyze scalability with the number of DPUs for the 2D partitioning techniques.Figures 18,19 and 20 compare the performance of the equally-sized, equally-wide and variable-sized schemes, respectively, using the COO format and the int32 data type, as the number of DPUs increases.Fig. 19.…”

mentioning

confidence: 99%

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

Giannoula¹,

Fernandez²,

Gómez-Luna³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In recent years, many researchers have focused on fully exploiting multiple different types of computing devices in heterogeneous platforms to cooperatively accelerate the execution of specific applications, such as minimal hitting set enumeration problem, 11 protein sequence alignment algorithms, 12 sparse matrix‐vector multiplication, 13 solidification modeling, 14 and high‐resolution image restoration algorithms 15 . The above research can fully utilize both multi‐core CPUs and many‐core GPUs/MICs to accelerate the execution of specified computational tasks, and the experimental results show that the performance is significantly improved compared with utilizing the CPUs, GPUs, or MICs alone.…”

Section: Introductionmentioning

confidence: 99%

HeteroPP: A directive‐based heterogeneous cooperative parallel programming framework

Wan,

Cui,

et al. 2024

Concurrency and Computation

View full text Add to dashboard Cite

Heterogeneous platforms composed of multiple different types of computing devices (such as CPUs, GPUs, and Intel MICs) have been widely used recently. However, most of parallel applications developed in such a heterogeneous platform usually only utilize a certain kind of computing device due to the lack of easy‐to‐use heterogeneous cooperative parallel programming models. To reduce the difficulty of heterogeneous cooperative parallel programming, a directive‐based heterogeneous cooperative parallel programming framework called HeteroPP is proposed. HeteroPP provides an easier way for programmers to fully exploit multiple different types of computing devices to concurrently and cooperatively perform data‐parallel applications on heterogeneous platforms. An extension to OpenMP directives and clauses is proposed to make it possible for programmers to easily offload a data‐parallel compute kernel to multiple different types of computing devices. A source‐to‐source compiler is designed to help programmers to automatically generate multiple device‐specific compute kernels that can be concurrently and cooperatively performed on heterogeneous platforms. Many experiments are conducted with 12 typical data‐parallel applications implemented with HeteroPP on a heterogeneous CPU‐GPU‐MIC platform. The results show that HeteroPP not only greatly simplifies the heterogeneous cooperative parallel programming, but also can fully utilize the CPUs, GPU, and MIC to efficiently perform these applications.

show abstract

“…In order to adapt the underlying architecture of hardware accelerators, researchers focus on the reconstruction of SpMV algorithm to improve the computing performance of hardware accelerators. Such as Intel Xeon Phi 10,11 , general purpose Graphics Processor (GPGPU) 12,13 , advanced micro devices (AMD) 14,15 , field programmable gate array (FPGA) 16,17 and so on. With the advent of Sunway Taihulight supercomputer 18 , it is equipped with SW26010P many-core processor unique hardware architecture and has strong parallel computing capabilities.…”

Section: Introductionmentioning

confidence: 99%

Implementation and Optimization of SpMV algorithm based on SW26010P many-core processor and stored in BCSR format

Ma,

Huang,

et al. 2024

Preprint

View full text Add to dashboard Cite

The irregular distribution of non-zero elements of large-scale sparse matrix leads to low data access efficiency caused by the unique architecture of the Sunway many-core processor, which brings great challenges to the efficient implementation of sparse matrix-vector multiplication (SpMV) computing by SW26010P many-core processor. To address this problem, a study of SpMV optimization strategies is carried out based on the SW26010P many-core processor. Firstly, we design a memorized data storage transformation strategy to transform the matrix in CSR storage format into BCSR storage. Secondly, the dynamic task scheduling method is introduced to the algorithm to realize the load balance between slave cores. Thirdly, the LDM memory is refined and designed, and the slave core dual cache strategy is optimized to further improve the performance. Finally, we selected a large number of representative sparse matrices from the Matrix Market for testing. The results show that the scheme has obviously speedup the processing procedure of sparse matrices with various sizes and sizes, and the master-slave speedup ratio can reach up to 38 times. The optimization method used in this paper has implications for other complex applications of the SW26010P many-core processor.

show abstract

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

Cited by 14 publications

References 32 publications

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

HeteroPP: A directive‐based heterogeneous cooperative parallel programming framework

Implementation and Optimization of SpMV algorithm based on SW26010P many-core processor and stored in BCSR format

Contact Info

Product

Resources

About