Fast Fourier Transform (FFT) is a fundamental operation for 2D data in various applications. To accelerate large-scale 2D-FFT computation, we propose a Heterogeneous parallel In-place 2D-FFT algorithm, HI-FFT. Our novel work decomposition method makes it possible to run our parallel algorithm on the original data (i.e., in-place), unlike prior parallel algorithms that require additional memory space (i.e., out-of-place) to guarantee independence among sub-tasks. Our work decomposition method also removes the duplicated operations on the out-of-place approaches. Using our decomposition method, we introduced an in-place heterogeneous parallel algorithm that utilizes both multi-core CPU and GPU simultaneously. To maximize the utilization efficiency of the computing resources, we also propose a priority-based dynamic scheduling method. We compared the performance of seven different 2D-FFT algorithms, including ours, for large-scale 2D-FFT problems whose sizes varied from 20K 2 to 120K 2 . As a result, we found that our method achieved up to 2.92 and 4.42 times higher performance than the conventional homogeneous parallel algorithms based on the state-of-the-art CPU and GPU libraries, respectively. Also, our method showed up to 2.27 times higher performance than the prior heterogeneous algorithms while requiring two times less memory space.To check the benefit of our HI-FFT on an actual application, we applied it to a CGH (Computer Generated Holography) process. We found that it successfully reduces the hologram generation time. These results demonstrate the advantage of our approach for large-scale 2D-FFT computation.
We propose a novel out-of-core GPU algorithm for 2D-Shift-FFT (i.e., 2D-FFT with FFT-shift) to generate ultra-high-resolution holograms. Generating an ultra-high-resolution hologram requires a large complex matrix (e.g., 100K2) with a size that typically exceeds GPU memory. To handle such a large-scale hologram plane with limited GPU memory, we employ a 1D-FFT based 2D-FFT computation method. We transpose the column data to have a continuous memory layout to improve the column-wise 1D-FFT stage performance in both the data communication and GPU computation. We also combine the FFT-shift and transposition steps to reduce and hide the workload. To maximize the GPU utilization efficiency, we exploit the concurrent execution ability of recent heterogeneous computing systems. We also further optimize our method’s performance with our cache-friendly chunk generation algorithm and pinned-memory buffer approach. We tested our method on three computing systems having different GPUs and various sizes of complex matrices. Compared to the conventional implementation based on the state-of-the-art GPU FFT library (i.e., cuFFT), our method achieved up to 3.24 and 3.06 times higher performance for a large-scale complex matrix in single- and double-precision cases, respectively. To assess the benefits offered by the proposed approach in an actual application, we applied our method to the layer-based CGH process. As a result, it reduced the time required to generate an ultra-high-resolution hologram (e.g., 100K2) up to 28% compared to the use of the conventional algorithm. These results demonstrate the efficiency and usefulness of our method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.