Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX

Alappat, Christie L.; Laukemann, Jan; Gruber, Tobias; Hager, Georg; Wellein, Gerhard; Meyer, Nils; Wettig, Tilo

doi:10.1109/pmbs51919.2020.00006

Cited by 14 publications

(7 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Actually, in the papers by Kreutzer et al [2014] and Almasri and Abu-Sufah [2020], we can find that almost no performance improvement by ELLPACK type kernels over the CSR kernel was obtained for sufficiently large matrices on standard multi-core CPUs. Here, it is worth noting that this tendency differs on many-core CPUs such as Intel Xeon Phi; the effectiveness of SpMV kernels using ELLPACK type formats was reported by Kreutzer et al [2014], Alappat et al [2020], and Nakajima et al [2021].…”

Section: Summary Of the Experimentsmentioning

confidence: 90%

Accelerating the SpMV kernel on standard CPUs by exploiting the partially diagonal structures

Fukaya,

Ishida,

Miura

et al. 2021

Preprint

View full text Add to dashboard Cite

Sparse Matrix Vector multiplication (SpMV) is one of basic building blocks in scientific computing, and acceleration of SpMV has been continuously required. In this research, we aim for accelerating SpMV on recent CPUs for sparse matrices that have a specific sparsity structure, namely a diagonally structured sparsity pattern. We focus a hybrid storage format that combines the DIA and CSR formats, so-called the HDC format. First, we recall the importance of introducing cache blocking techniques into HDC-based SpMV kernels. Next, based on the observation of the cache blocked kernel, we present a modified version of the HDC formats, which we call the M-HDC format, in which partial diagonal structures are expected to be more efficiently picked up. For these SpMV kernels, we theoretically analyze the expected performance improvement based on performance models. Then, we conduct comprehensive experiments on state-of-the-art multi-core CPUs. By the experiments using typical matrices, we clarify the detailed performance characteristics of each SpMV kernel. We also evaluate the performance for matrices appearing in practical applications and demonstrate that our approach can accelerate SpMV for some of them. Through the present paper, we demonstrate the effectiveness of exploiting partial diagonal structures by the M-HDC format as a promising approach to accelerating SpMV on CPUs for a certain kind of practical sparse matrices.

show abstract

Section: Summary Of the Experimentsmentioning

confidence: 90%

Accelerating the SpMV kernel on standard CPUs by exploiting the partially diagonal structures

Fukaya,

Ishida,

Miura

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We have further improved the ECM machine model for the A64FX CPU introduced in [1] and showed its applicability to the Fugaku processor. We validated the model with simple streaming kernels and could observe a high accuracy for in-memory data sets.…”

Section: Discussionmentioning

confidence: 99%

“…More importantly, we have substantially increased the scope of both topics, e.g., by improving the ECM model considering the impact of page sizes and by presenting a detailed ECM model and performance-tuning strategies for SpMV. Topics presented here but not covered in [1] include the case study of the Lattice QCD kernel, the investigation of power-saving mechanisms and specific hardware features of the A64FX and the comparison with state-of-the-art CPUs and GPGPUs.…”

Section: Extended Version Of Workhop Short Papermentioning

confidence: 99%

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

Alappat,

Meyer,

Laukemann

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The A64FX CPU is arguably the most powerful Arm-based processor design to date.Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. A good understanding of its performance features is of paramount importance for developers who wish to leverage its full potential. We present an architectural analysis of the A64FX used in the Fujitsu FX1000 supercomputer at a level of detail that allows for the construction of Execution-Cache-Memory (ECM) performance models for steady-state loops. In the process we identify architectural peculiarities that point to viable generic optimization strategies. After validating the model using simple streaming loops we apply the insight gained to sparse matrix-vector multiplication (SpMV) and the domain wall (DW) kernel from quantum chromodynamics (QCD). For SpMV we show why the CRS matrix storage format is not a good practical choice on this architecture and how the SELL--format can achieve bandwidth saturation. For the DW kernel we provide a cache-reuse analysis and show how an appropriate choice of data layout for complex arrays can realize memory-bandwidth saturation in this case as well. A comparison with state-of-the-art high-end Intel Cascade Lake AP and Nvidia V100 systems puts the capabilities of the A64FX into perspective. We also explore the potential for power optimizations using the tuning knobs provided by the Fugaku system, achieving energy savings of about 31% for SpMV and 18% for DW.

show abstract

“…benchmarks or other code optimization projects). The emerging pattern is that high speedups typically require much more involved optimization work, such as explicit loop unrolling and development of detailed performance models [25], which call for dedicate projects, when not dedicated staff; but can result in general optimization hints all users will benefit from.…”

Section: Test Run On A64fx Architecturementioning

confidence: 99%

Optimizing the hybrid parallelization of BHAC

Cielo¹,

Porth²,

Iapichino³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present our experience with the modernization on the GR-MHD code BHAC, aimed at improving its novel hybrid (MPI+OpenMP) parallelization scheme. In doing so, we showcase the use of performance profiling tools usable on x86 (Intelbased) architectures.Our performance characterization and threading analysis provided guidance in improving the concurrency and thus the efficiency of the OpenMP parallel regions. We assess scaling and communication patterns in order to identify and alleviate MPI bottlenecks, with both runtime switches and precise code interventions. The performance of optimized version of BHAC improved by ∼ 28%, making it viable for scaling on hundreds of thousands of supercomputer nodes.We finally test whether porting such optimizations to different hardware is likewise beneficial on the new architecture by running on ARM A64FX vector nodes.

show abstract

Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX

Cited by 14 publications

References 12 publications

Accelerating the SpMV kernel on standard CPUs by exploiting the partially diagonal structures

Accelerating the SpMV kernel on standard CPUs by exploiting the partially diagonal structures

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

Optimizing the hybrid parallelization of BHAC

Contact Info

Product

Resources

About