The Effects of Wide Vector Operations on Processor Caches

Poenaru, Andrei; McIntosh–Smith, Simon

doi:10.1109/cluster49012.2020.00076

Cited by 3 publications

(1 citation statement)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Dongarra 30 reported on basic architectural features, HPC benchmarks (HPL, HPCG, HPL‐AI) and the software environment of the Fugaku system. Poenaru and McIntosh‐Smith 31 presented results on the effect of using wide vector registers and compared the performance and cache behavior of the A64FX for HPC benchmarks to the ThunderX2 platform. Both Odajima et al 32 and Jackson et al 33 investigated benchmarks, full applications and proxy apps in comparison to Intel and other Arm‐based systems but did not use performance models for analysis.…”

Section: Discussionmentioning

confidence: 99%

Execution‐Cache‐Memory modeling and performance tuning of sparse matrix‐vector multiplication and Lattice quantum chromodynamics on A64FX

Alappat

Meyer

Laukemann

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

The A64FX CPU is arguably the most powerful Arm-based processor design to date.Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. A good understanding of its performance features is of paramount importance for developers who wish to leverage its full potential. We present an architectural analysis of the A64FX used in the Fujitsu FX1000 supercomputer at a level of detail that allows for the construction of Execution-Cache-Memory performance models for steady-state loops. In the process we identify architectural peculiarities that point to viable generic optimization strategies. After validating the model using simple streaming loops we apply the insight gained to sparse matrix-vector multiplication (SpMV) and the domain wall (DW) kernel from quantum chromodynamics. For SpMV we show why the compressed row storage (CRS) matrix storage format is not a good practical choice on this architecture and how the SELL-C-𝜎 format can achieve bandwidth saturation. For the DW kernel we provide a cache-reuse analysis and show how an appropriate choice of data layout for complex arrays can realize memory-bandwidth saturation in this case as well. A comparison with state-of-the-art high-end Intel Cascade Lake AP and Nvidia V100 systems puts the capabilities of the A64FX into perspective. We also explore the potential for power optimizations using the tuning knobs provided by the Fugaku system, achieving energy savings of about 31% for SpMV and 18% for DW.

show abstract

Section: Discussionmentioning

confidence: 99%

Execution‐Cache‐Memory modeling and performance tuning of sparse matrix‐vector multiplication and Lattice quantum chromodynamics on A64FX

Alappat

Meyer

Laukemann

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

Accelerating CNN inference on long vector architectures via co-design

Rani

Παπαδοπούλου

Pericàs

2023

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

CPU-based inference can be deployed as an alternative to off-chip accelerators. In this context, emerging vector architectures are a promising option, owing to their high efficiency. Yet the large design space of convolutional algorithms and hardware implementations makes the selection of design options challenging. In this paper, we present our ongoing research into co-designing future vector architectures for CPU-based Convolutional Neural Networks (CNN) inference focusing on the im2col+GEMM and Winograd kernels. Using the Gem5 simulator we explore the impact of several hardware microarchitectural features including (i) vector lanes, (ii) vector lengths, (iii) cache sizes, and (iv) options for integrating the vector unit into the CPU pipeline. In the context of im2col+GEMM, we study the impact of several BLIS-like algorithmic optimizations such as (1) utilization of vector registers, (2) loop unrolling, (3) loop reorder, (4) manual vectorization, (5) prefetching, and (6) packing of matrices, on the RISC-V Vector Extension and ARM-SVE ISAs. We use the YOLOv3 and VGG16 network models for our evaluation. Our co-design study shows that BLIS-like optimizations are not beneficial to all types of vector microarchitectures. We additionally demonstrate that longer vector lengths (of at least 8192 bits) and larger caches (of 256MB) can boost performance by 5×, with our optimized CNN kernels, compared to a vector length of 512-bit and 1MB of L2 cache. In the context of Winograd, we present our novel approach of inter-tile parallelization across the input/output channels by using 8×8 tiles per channel to vectorize the algorithm on vector length agnostic (VLA) architectures. Our method exploits longer vector lengths and offers high memory reuse, resulting in performance improvement of up to 2.4× for non-strided convolutional layers with 3×3 kernel size, compared to our optimized im2col+GEMM approach on the Fujitsu A64FX processor. Our co-design study furthermore reveals that Winograd requires smaller cache sizes (up to 64MB) compared to im2col+GEMM.

show abstract

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

Alappat,

Meyer,

Laukemann

et al. 2021

Preprint

View full text Add to dashboard Cite

The A64FX CPU is arguably the most powerful Arm-based processor design to date.Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. A good understanding of its performance features is of paramount importance for developers who wish to leverage its full potential. We present an architectural analysis of the A64FX used in the Fujitsu FX1000 supercomputer at a level of detail that allows for the construction of Execution-Cache-Memory (ECM) performance models for steady-state loops. In the process we identify architectural peculiarities that point to viable generic optimization strategies. After validating the model using simple streaming loops we apply the insight gained to sparse matrix-vector multiplication (SpMV) and the domain wall (DW) kernel from quantum chromodynamics (QCD). For SpMV we show why the CRS matrix storage format is not a good practical choice on this architecture and how the SELL--format can achieve bandwidth saturation. For the DW kernel we provide a cache-reuse analysis and show how an appropriate choice of data layout for complex arrays can realize memory-bandwidth saturation in this case as well. A comparison with state-of-the-art high-end Intel Cascade Lake AP and Nvidia V100 systems puts the capabilities of the A64FX into perspective. We also explore the potential for power optimizations using the tuning knobs provided by the Fugaku system, achieving energy savings of about 31% for SpMV and 18% for DW.

show abstract

The Effects of Wide Vector Operations on Processor Caches

Cited by 3 publications

References 10 publications

Execution‐Cache‐Memory modeling and performance tuning of sparse matrix‐vector multiplication and Lattice quantum chromodynamics on A64FX

Execution‐Cache‐Memory modeling and performance tuning of sparse matrix‐vector multiplication and Lattice quantum chromodynamics on A64FX

Accelerating CNN inference on long vector architectures via co-design

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

Contact Info

Product

Resources

About