Evaluating the performance of FFT library implementations on modern hybrid computing systems

Malkovsky, S.I.; Сорокин, А. А.; Tsoy, Georgiy; Korolev, S.P.; Smagin, S. I.; Kondrashev, V. A.

doi:10.1007/s11227-020-03591-6

Cited by 5 publications

(2 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More recently, the research community is focusing on developing ecient FFT implementations targeting emerging architectures with dierent degrees of parallelism, e.g., high number of cores and long SIMD or vector units. Chow et al [3] report their eort in taking advantage of the IBM Cell BE for the computation of large FFTs; Anderson et al [1] make use of FPGAs for accelerating 3D FFTs; Wang et al [13] present an FFT optimization for Armv8 architectures; Malkovsky et al [9] evaluate FFTs on heterogeneous HPC compute nodes including GP-GPUs. Most of those studies are limited to up to 8-elements SIMD units in CPUs or high thread-level parallelism in GPUs while the implementations proposed in our paper are targeting wider vector units.…”

Section: Introductionmentioning

confidence: 99%

Accelerating FFT Using NEC SX-Aurora Vector Engine

Vizcaino

Mantovani

Labarta

2022

Euro-Par 2021: Parallel Processing Workshops

View full text Add to dashboard Cite

Novel architectures leveraging long and variable vector lengths like the NEC SX-Aurora or the vector extension of RISCV are appearing as promising solutions on the supercomputing market. These architectures often require re-coding of scientic kernels. For example, traditional implementations of algorithms for computing the fast Fourier transform (FFT) cannot take full advantage of vector architectures. In this paper, we present the implementation of FFT algorithms able to leverage these novel architectures. We evaluate these codes on NEC SX-Aurora, comparing them with the optimized NEC libraries. We present the benets and limitations of two approaches of RADIX-2 FFT vector implementations. We show that our approach makes better use of the vector unit, reaching higher performance than the optimized NEC library for FFT sizes under 64k elements. More generally, we prove the importance of maximizing the vector length usage of the algorithm and that adapting the algorithm to replace memory instructions with register shuing operations can boost the performance of FFT-like computational kernels.1 This version of the contribution has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections.

show abstract

Section: Introductionmentioning

confidence: 99%

Accelerating FFT Using NEC SX-Aurora Vector Engine

Vizcaino

Mantovani

Labarta

2022

Euro-Par 2021: Parallel Processing Workshops

View full text Add to dashboard Cite

show abstract

“…4 More recently, the research community is focusing on developing efficient FFT implementations targeting emerging architectures with different degrees of parallelism, for example, high number of cores and long SIMD or vector units. Chow et al 5 report their effort in taking advantage of the IBM Cell BE for the computation of large FFTs; Anderson et al 6 make use of FPGAs for accelerating 3D FFTs; Wang et al 7 present an FFT optimization for Armv8 architectures; Malkovsky et al 8 evaluate FFTs on heterogeneous HPC compute nodes including GP-GPUs. Most of those studies are limited to up to 8-elements SIMD units in CPUs or high thread-level parallelism in GPUs while the implementations proposed in our article are targeting wider vector units.…”

Section: Related Workmentioning

confidence: 99%

Acceleration with long vector architectures: Implementation and evaluation of the FFT kernel on NEC SX‐Aurora and RISC‐V vector extension

Vizcaino

Mantovani

Ferrer

et al. 2022

Concurrency and Computation

View full text Add to dashboard Cite

SummaryNovel architectures leveraging long and variable vector lengths like the NEC SX‐Aurora or the vector extension of RISCV are appearing as promising solutions on the supercomputing market. These architectures often require re‐coding of scientific kernels. For example, traditional implementations of algorithms for computing the fast Fourier transform (FFT) cannot take full advantage of vector architectures. In this article, we present the implementation of FFT algorithms able to leverage these novel architectures. We evaluate these codes on NEC SX‐Aurora , comparing them with the optimized NEC libraries; and in a prototype of a RISC‐V core with a vector processing unit. We present the benefits and limitations of two approaches of RADIX‐2 FFT vector implementations. We show that our approach makes better use of the vector unit of the NEC SX‐Aurora , reaching higher or equal performance than the optimized NEC library. More generally, we prove the importance of maximizing the vector length usage of the algorithm, taking advantage of the FFT properties to reduce long‐latency vector operations, and reordering the instructions according to the specific hardware features to boost the performance of FFT‐like computational kernels.

show abstract

4-Valued spectral transforms implementation on GPU with Tensor Cores

Marković

Stojkovic

2022

J Supercomput

View full text Add to dashboard Cite

Evaluating the performance of FFT library implementations on modern hybrid computing systems

Cited by 5 publications

References 23 publications

Accelerating FFT Using NEC SX-Aurora Vector Engine

Accelerating FFT Using NEC SX-Aurora Vector Engine

Acceleration with long vector architectures: Implementation and evaluation of the FFT kernel on NEC SX‐Aurora and RISC‐V vector extension

4-Valued spectral transforms implementation on GPU with Tensor Cores

Contact Info

Product

Resources

About