High-level parallel programming approaches have recently become popular in complex fluid dynamics research since they are cross-platform and easy to implement. The parallelism paradigm currently gaining popularity is to shift the careful optimization from programmers to intelligent compilers that ideally could implement an efficient usage of hardware resources. OpenACC is a high-level directive-based parallel library for programming on graphics processing units (GPU). The programming development procedure on CPU and GPU platforms can be effectively unified using the capability of portable programming with OpenACC. The directive-based model stands in contrast with the detailed implementation model of CUDA and OpenCL that suffers from poor portability and maintainability with changes in the accelerator hardware. In this work, we put effort toward drawing some outlines on efficiently annotating the base serial CPU version of the CFD code to achieve acceptable multi-threaded computational performance on the GPU. An Artificial Compressibility Method (ACM) is used for studying steady-state incompressible 2D and 3D heat and fluid flows. We study loop scheduling with careful attention given to the limited on-chip memory available. The possibility of asynchronous calculations in the CFD algorithm is investigated to increase the computing performance. The PGI Fortran 15.7 compiler is used for the following study. Memory management and loop scheduling considerations result in approximately 20% speedup in the GPU computing performance from the base OpenACC implementation. A performance analysis of the thermal flow solver shows that it works approximately 8 times faster on a single NVIDIA Tesla C2075 GPU card compared to a single Xeon processor core and demonstrates comparable speed to 8 cores on a Xeon CPU server using an MPI multi-CPU implementation of the code. A multi-GPU performance analysis is performed using NVIDIA Tesla M2050 GPUs on the Virginia Tech HokieSpeed supercomputer. A weak scalability analysis demonstrates up to 90% efficiency with 32 GPUs. With 16 GPUs the thermal flow solver works up to 190 times faster than a single Xeon processor CPU, 10 times faster than a single GPU, and 12 times faster than 16 Xeon processor CPUs distributed on separate sockets on the HokieSpeed supercomputer using an MPI multi-CPU implementation of the code.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.