SummaryStencil computations are an important class of problems that can benefit from graphics processing units (GPUs). However, given the hierarchical and on‐chip blocked memory organization in GPUs, the memory performance degrades for specific data access patterns in stencils. Hence, we need appropriate data layout to effectively use the different levels of the memory to harvest the full potential of GPUs. In this context, a specialized stencil computation problem, namely, Lattice Boltzmann Method, which has a complex neighborhood relationship along with loop carried dependence, is considered as a strong case study. Four different approaches for the lattice Boltzmann method have been developed in this work by exploiting memory hierarchy with new data layouts and kernel organizations. These methods have been developed with the primary aim of increasing the compute to global memory access ratio and reducing the overall read‐write latency, even at the expense of additional computations. NVIDIA GPUs TitanX, GTX 960, GTX 740Ti, and GTX 650Ti have been used to test the proposed techniques. The compute to global memory access ratio shows an improvement of 2 to 10 times over the naive solutions in this work. The performance, in terms of time taken per iteration, is improved by up to 3.7 times. The million lattice units per second for both 2DQ9 and 3DQ19 models improve by more than 2 times.
Summary
Graphics Processing Units (GPUs) are used as accelerators for improving performance while executing highly data parallel applications. The GPUs are characterized by a number of Streaming Multiprocessors (SM) and a large number of cores within each SM. In addition to this, a hierarchy of memories with different latencies and sizes is present in the GPUs. The program execution in GPUs is thus dependent on a number of parameter values, both at compile time and runtime. To obtain the optimal performance with these GPU resources, a large parameter space is to be explored, and this leads to a number of unproductive program executions. To alleviate this difficulty, machine learning–based autotuning systems are proposed to predict the right configuration using a limited set of compile‐time parameters. In this paper, we propose a two‐stage machine learning–based autotuning framework using an expanded set of attributes. The important parameters such as block size, occupancy, eligible warps, and execution time are predicted. The mean relative error in prediction of different parameters ranges from of 16% to 6.5%. Dimensionality reduction for the features set reduces the features by up to 50% with further increase in prediction accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.