An efficient, model-based CPU-GPU heterogeneous FFT library

Ogata, Yasuhiko; Endo, Tetsuro; Maruyama, Naoya; Matsuoka, Satoshi

doi:10.1109/ipdps.2008.4536163

Cited by 33 publications

(10 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They are typically used in applications where data locality is important because they do not require data redistribution. The methods [32], [33], [34] solve the singleobjective optimization problem for performance on heterogeneous platforms. The methods [11], [12], [15] solve the bi-objective optimization problem for performance and energy for homogeneous and heterogeneous platforms.…”

Section: Static and Dynamic Optimization Methodsmentioning

confidence: 99%

Acceleration of Bi-Objective Optimization of Data-Parallel Applications for Performance and Energy on Heterogeneous Hybrid Platforms

2023

View full text Add to dashboard Cite

Accelerating the bi-objective optimization of applications for performance and energy is crucial to achieving energy efficiency objectives and meeting quality-of-service requirements in modern high-performance computing platforms and cloud computing infrastructures. In this work, we highlight the crucial challenges to accelerate model-based methods proposed for the bi-objective optimization of data-parallel applications for performance and energy that employ workload distribution between the executing processors as the decision variable. The methods solve unconstrained bi-objective optimization problems and take input, the processors' performance and energy profiles in the form of discrete functions of workload size, and output Pareto-optimal solutions (workload distributions), minimizing the execution time and the total energy consumption of computations during the parallel execution of the application. One of the challenges is the fast computation of Pareto-optimal solutions. We then formulate the bi-objective optimization problem of data-parallel applications for performance and energy through workload distribution on a cluster of p identical hybrid nodes, each containing h heterogeneous processors. The state-of-the-art algorithm for solving the problem is sequential and takes exorbitant execution times to find Pareto-optimal solutions for even moderate numbers of processors. We propose two algorithms that address this shortcoming. The first algorithm is an exact sequential algorithm that is more efficient and amenable to parallelization and achieves a complexity reduction of O(m × h) over the state-of-the-art sequential algorithm where m is the cardinality of the input discrete execution time and dynamic energy functions. The second algorithm is a parallel algorithm executed by q identical parallel processes that reduces the complexity of our proposed sequential algorithm by O(q) and therefore achieves a complexity reduction of O(m × h × q) over the state-of-the-art sequential algorithm. Finally, we experimentally analyze the practical efficacy of our proposed algorithms for two data-parallel applications, matrix multiplication and fast Fourier transform, on a heterogeneous hybrid node containing an Intel Haswell multicore CPU, an Nvidia k40c GPU, and an Nvidia P100 GPU and simulations of clusters of such hybrid nodes. The experiments demonstrate that our proposed algorithms provide tremendous speedups over state-of-the-art solutions.INDEX TERMS high-performance heterogeneous computing, energy-efficient computing, biobjective optimization, performance optimization, energy optimization, data-parallel applications, workload distribution I. INTRODUCTION Performance and energy are the two most important objectives for optimization in modern high per-formance computing (HPC) platforms, computational grids, data centers, and cloud computing infrastructures ([1],[2],[3],[4]). Achieving the energy efficiency objectives

show abstract

Section: Static and Dynamic Optimization Methodsmentioning

confidence: 99%

Acceleration of Bi-Objective Optimization of Data-Parallel Applications for Performance and Energy on Heterogeneous Hybrid Platforms

2023

View full text Add to dashboard Cite

show abstract

“…Additionally, to decrease the overhead for data transfer between the host and device memories, they performed matrix transposition before sending data. Chen and Li [10] extended the approach of Gu and others [16] and used both a GPU and CPU for FFT computations, similar to Ogata and others [24]. Unlike Gu and others [16], they used a 2D data-copy application programming interface (API) instead of gathering multiple subarrays before sending them to transfer multidimensional data.…”

Section: Related Workmentioning

confidence: 99%

Large‐scale 3D fast Fourier transform computation on a GPU

Lee

Kim

2023

ETRI Journal

View full text Add to dashboard Cite

We propose a novel graphics processing unit (GPU) algorithm that can handle a large‐scale 3D fast Fourier transform (i.e., 3D‐FFT) problem whose data size is larger than the GPU's memory. A 1D FFT‐based 3D‐FFT computational approach is used to solve the limited device memory issue. Moreover, to reduce the communication overhead between the CPU and GPU, we propose a 3D data‐transposition method that converts the target 1D vector into a contiguous memory layout and improves data transfer efficiency. The transposed data are communicated between the host and device memories efficiently through the pinned buffer and multiple streams. We apply our method to various large‐scale benchmarks and compare its performance with the state‐of‐the‐art multicore CPU FFT library (i.e., fastest Fourier transform in the West [FFTW]) and a prior GPU‐based 3D‐FFT algorithm. Our method achieves a higher performance (up to 2.89 times) than FFTW; it yields more performance gaps as the data size increases. The performance of the prior GPU algorithm decreases considerably in massive‐scale problems, whereas our method's performance is stable.

show abstract

“…Performance models have been proposed to implement work-distribution schemes (Choi et al 2013; Zhong et al 2012). In Ogata et al (2008), the authors present a library for 2D Fast Fourier Transform (FFT) that automatically uses both CPUs and GPUs to achieve optimal performance. Using a performance model, it evaluates the respective contributions of each computing unit and then makes an estimation of total execution times.…”

Section: Related Workmentioning

confidence: 99%

Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications

Tallada

Morancho

2023

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Hybrid computer systems combine compute units (CUs) of different nature like CPUs, GPUs and FPGAs. Simultaneously exploiting the computing power of these CUs requires a careful decomposition of the applications into balanced parallel tasks according to both the performance of each CU type and the communication costs among them. This paper describes the design and implementation of runtime support for OpenMP hybrid GPU-CPU applications, when mixed with GPU-oriented programming models (e.g. CUDA/HIP). The paper describes the case for a hybrid multi-level parallelization of the NPB-MZ benchmark suite. The implementation exploits both coarse-grain and fine-grain parallelism, mapped to compute units of different nature (GPUs and CPUs). The paper describes the implementation of runtime support to bridge OpenMP and HIP, introducing the abstractions of Computing Unit and Data Placement. We compare hybrid and non-hybrid executions under state-of-the-art schedulers for OpenMP: static and dynamic task schedulings. Then, we improve the set of schedulers with two additional variants: a memorizing-dynamic task scheduling and a profile-based static task scheduling. On a computing node composed of one AMD EPYC 7742 @ 2.250 GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 × GPU AMD Radeon Instinct MI50 with 32 GB, hybrid executions present speedups from 1.10× up to 3.5× with respect to a non-hybrid GPU implementation, depending on the number of activated CUs.

show abstract

An efficient, model-based CPU-GPU heterogeneous FFT library

Cited by 33 publications

References 10 publications

Acceleration of Bi-Objective Optimization of Data-Parallel Applications for Performance and Energy on Heterogeneous Hybrid Platforms

Acceleration of Bi-Objective Optimization of Data-Parallel Applications for Performance and Energy on Heterogeneous Hybrid Platforms

Large‐scale 3D fast Fourier transform computation on a GPU

Heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications

Contact Info

Product

Resources

About