Optimizing the computation of a parallel 3D finite difference algorithm for graphics processing units

Porter-Sobieraj, Joanna; Cygert, Sebastian; Kikoła, D. P.; Sikorski, Jan; Słodkowski, M.

doi:10.1002/cpe.3351

Cited by 8 publications

(9 citation statements)

References 20 publications

(43 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[7][8][9]11,19,21 This happens because, although using row-major order enables some coalesced memory accesses for block warps, all blocks in the grid will compete for space in the cache. However, as each thread within a block will need its own data plus its six neighbors, this may lead to data access redundancy.…”

Section: Figurementioning

confidence: 99%

“…This means that the chance of a cache miss increases as the grid size grows. [7][8][9]11,21 In CUDA, threads of the same block can communicate with each other through shared memory, which is a low-latency memory, but limited in total storage. According to Equation (5), at the time thread k asks for data in position (x + 1, y, z), thread k + 1 had requested this same data already, so it should be in cache memory.…”

Section: Figurementioning

confidence: 99%

“…They evaluated not only the performance of adapted CPU implementations for GPUs, [33][34][35] but also different parallel strategies 11,20,[36][37][38][39] and numerical methods. This means they can process up to 9.96 × 10 10 domain points (cells) per second. The authors compared spiral waves simulated with their WebGL approach with spiral waves from experimental data.…”

Section: Related Workmentioning

confidence: 99%

“…[7][8][9] With this in mind, we propose in this work a novel and smart data structure suitable for problems that access data from nearest neighbors, such as the discretization of the Laplacian operator, and we emphasize the GPU memory architecture as well as the CPU features. Discretization of the Laplacian requires some data memory accesses in locations that are physically close in space.…”

mentioning

confidence: 99%

“…As a result, 3D domains, eg, require several clock cycles to load all the necessary data, thereby increasing the latency of memory access. [7][8][9] With this in mind, we propose in this work a novel and smart data structure suitable for problems that access data from nearest neighbors, such as the discretization of the Laplacian operator, and we emphasize the GPU memory architecture as well as the CPU features. Although our solution may be generalized for several problems that use Laplacian discretizations, we focus this work on cardiac electrical simulations.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Accelerating simulations of cardiac electrical dynamics through a multi‐GPU platform and an optimized data structure

Vasconcellos

Clua

Fenton

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Simulations of cardiac electrophysiological models in tissue, particularly in 3D require the solutions of billions of differential equations even for just a couple of milliseconds, thus highly demanding in computational resources. In fact, even studies in small domains with very complex models may take several hours to reproduce seconds of electrical cardiac behavior.Today's Graphics Processor Units (GPUs) are becoming a way to accelerate such simulations, and give the added possibilities to run them locally without the need for supercomputers.Nevertheless, when using GPUs, bottlenecks related to global memory access caused by the spatial discretization of the large tissue domains being simulated, become a big challenge. For simulations in a single GPU, we propose a strategy to accelerate the computation of the diffusion term through a data-structure and memory access pattern designed to maximize coalescent memory transactions and minimize branch divergence, achieving results approximately 1.4 times faster than a standard GPU method. We also combine this data structure with a designed communication strategy to take advantage in the case of simulations in multi-GPU platforms.We demonstrate that, in the multi-GPU approach performs, simulations in 3D tissue can be just 4× slower than real time. KEYWORDScardiac electrophysiology models, GPU Computing, memory access optimization, parallel cardiac dynamics simulations INTRODUCTIONThe large increase of computational power over the last years shifted the bottleneck of different algorithms to the memory bandwidth and memory management. 1 One typical solution employed by hardware assemblers to minimize this issue is hardware hierarchical memory and memory locality optimization.Computational systems organize hierarchical memory system into levels. In the on-chip level, the registers are the fastest memory, with a high cost per byte and low capacity. Next, there are different cache levels according to the hardware architecture, typically called L1, L2, and so on.The main memory is the next level; here, the cost per byte is less than cache or registers, but latency is high. The last level is the secondary memory that has the highest latency with the lowest cost per byte. Overall, the cost per byte of each level determines the capacity and latency, which directly impact in performance.As each level of the hierarchical memory system has a different storage capacity and data is usually kept at the lowest memory level, computational systems must choose for each level which data will be prioritized to stay in memory and which will be removed when that memory level fills up. To do so, the computer memory system employs two fundamental principles, ie, temporal and spatial locality. 2 In general, these strategies aim to keep the most recently used data in the same memory level, since having to access higher memory levels drastically increases the time of the search.Based on the memory hierarchical principles, some researchers have tried to minimize memory system bottlenecks through s...

show abstract

Section: Figurementioning

confidence: 99%

Section: Figurementioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Accelerating simulations of cardiac electrical dynamics through a multi‐GPU platform and an optimized data structure

Vasconcellos

Clua

Fenton

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

Parallel computing on graphics processing units and heterogeneous platforms

Bientinesi

Herrero

Quintana–Ort́ı

et al. 2014

Concurrency and Computation

View full text Add to dashboard Cite

During the past decade, high-performance computing evolved toward multi-core and many-core architectures. General-purpose processors feature now dozens of coarse-grain (complex) cores each with four to eight SIMD lanes for parallel computation and multi-channel memory buses for high bandwidth. Hardware accelerators such as graphics processing units (GPUs) have also a two-stage design with multiple coarse units that contain an even higher number of SIMD lanes and wider memory buses for higher bandwidth. Therefore, the adoption of hardware accelerators is rapidly advancing in performance sensitive areas. They are particularly relevant in high-throughput disciplines such as high-quality 3D computer graphics and vision, real-time data stream processing, and high-performance scientific computing. The main reason behind this trend is that these accelerators can potentially yield speedups and energy savings orders of magnitude higher than those obtained with optimized implementations for general-purpose CPU cores. A clear indicator of this trend is the prevalence of these accelerators in the supercomputing systems in the top positions of both the TOP500 and Green500 lists. As a result, during the past few years, these architectures have become powerful, capable, and inexpensive mainstream coprocessors, useful for a wide variety of applications. Furthermore, they are nowadays present in a large variety of machines, ranging from low-end single user-platforms to supercomputers.However, the benefits of heterogeneous systems do not come 'for free': scientists using these platforms have to deal not only with multiple parallelism levels, but also with the programmability differences of available accelerators. To address these challenges, we observe the development of a very rich environment for their programming, particularly in comparison with the restricted landscape of only a few years ago. A key criterion to characterize the new high-level programming tools and libraries for these devices is their positioning within the triangle of performance, coding comfort and specialization. The spectrum ranges from high-performance building blocks for common numeric or discrete transformations, to domain-specific libraries that facilitate the solution of a certain class of problems, and to general high-level abstractions targeted toward increasing programmers' productivity.In summary, the advances both in the hardware and in the programmability of accelerators, coupled with their potentially appealing performance/power ratio for a wide range of applications, have pushed organizations to invest in heterogeneous systems that include accelerators and have motivated researchers to port their algorithms to such systems and develop novel tools to facilitate their usage.This special issue contributes to this important field with extended and carefully reviewed versions of selected papers from two workshops, namely the 3rd Minisymposium on GPU Computing, which was held as

show abstract

GPIC: A set of high-efficiency CUDA Fortran code using gpu for particle-in-cell simulation in space physics

Xiong,

Huang,

Yuan

et al. 2024

Computer Physics Communications

View full text Add to dashboard Cite

Optimizing the computation of a parallel 3D finite difference algorithm for graphics processing units

Cited by 8 publications

References 20 publications

Accelerating simulations of cardiac electrical dynamics through a multi‐GPU platform and an optimized data structure

Accelerating simulations of cardiac electrical dynamics through a multi‐GPU platform and an optimized data structure

Parallel computing on graphics processing units and heterogeneous platforms

GPIC: A set of high-efficiency CUDA Fortran code using gpu for particle-in-cell simulation in space physics

Contact Info

Product

Resources

About