Image processing algorithms are dominating contemporary digital systems due to their importance and adoption by a large number of application domains. Despite their significance, their computational requirements often limit their usage, especially in deeply embedded designs. Heterogeneous computing systems offer a promising solution for this performance gap, leading to their ever increasing utilization by designers. This work targets the acceleration of an image registration pipeline on a System-on-Chip (SoC) including both general purpose and re-configurable computing elements. The evaluation of our proposed HW/SW co-designed image registration application on a state-of-the-art FPGA based SoC showcases its ability to outperform software designs leading to orders of performance speedup (up to 67x) against embedded CPUs. CCS CONCEPTS• Computing methodologies → Image processing; • Hardware → Hardware accelerators; Hardware-software codesign.
Large-scale simulations of wave-type equations have many industrial applications, such as in oil and gas exploration. Realistic simulations, which involve a vast amount of data, are often performed on multiple nodes of an HPC cluster. Using GPUs for these simulations is attractive due to considerable parallelizability of the algorithms. Many industry-relevant simulations have characteristics in their physics or geometry that can be exploited to improve computational efficiency. Furthermore, the choice of simulation algorithm impacts computational efficiency significantly.In this work, we exploit these features to significantly improve performance for a class of problems. Specifically, we use the discontinuous Galerkin (DG) finite element method, along with the Gauss-Lobatto-Legendre (GLL) integration scheme on hexahedral elements with straight faces, which then greatly reduces the number of BLAS operations, and simplify the computations to Level-1 BLAS operations, reducing the turn around time for wave simulation. However, attaining peak performance of GPUs is often not possible in these codes that exacerbate bottlenecks caused by data movement, even when modern GPUs enjoying the latest highbandwidth memory are being used.We have developed GAPS, an efficient and scalable, GPUaccelerated PDE solver for Wave Simulation, by using hardwareand data-movement-aware algorithms. While significant speed-up over CPUs can be achieved, data movement still limits GPU performance. We present several optimization strategies, including kernel fusion, Look-Up-Table-based neighbor search, improved shared memory utilization, and SM-occupancy-aware register allocation. They improve performance up to 84.15x over CPU implementations and 1.84x over base GPU implementations on average. We then extend GAPS to support multi-GPUs on multi-node HPC clusters for large-scale wave simulations, and perform additional optimizations to reduce communication overhead. We also investigate the performance of several MPI libraries in order to fully overlap communication and computation. We are able to reduce the communication overhead by 70%, and achieve weak-scaling over 128 GPUs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.