Simulations of cardiac electrophysiological models in tissue, particularly in 3D require the solutions of billions of differential equations even for just a couple of milliseconds, thus highly demanding in computational resources. In fact, even studies in small domains with very complex models may take several hours to reproduce seconds of electrical cardiac behavior.Today's Graphics Processor Units (GPUs) are becoming a way to accelerate such simulations, and give the added possibilities to run them locally without the need for supercomputers.Nevertheless, when using GPUs, bottlenecks related to global memory access caused by the spatial discretization of the large tissue domains being simulated, become a big challenge. For simulations in a single GPU, we propose a strategy to accelerate the computation of the diffusion term through a data-structure and memory access pattern designed to maximize coalescent memory transactions and minimize branch divergence, achieving results approximately 1.4 times faster than a standard GPU method. We also combine this data structure with a designed communication strategy to take advantage in the case of simulations in multi-GPU platforms.We demonstrate that, in the multi-GPU approach performs, simulations in 3D tissue can be just 4× slower than real time.
KEYWORDScardiac electrophysiology models, GPU Computing, memory access optimization, parallel cardiac dynamics simulations
INTRODUCTIONThe large increase of computational power over the last years shifted the bottleneck of different algorithms to the memory bandwidth and memory management. 1 One typical solution employed by hardware assemblers to minimize this issue is hardware hierarchical memory and memory locality optimization.Computational systems organize hierarchical memory system into levels. In the on-chip level, the registers are the fastest memory, with a high cost per byte and low capacity. Next, there are different cache levels according to the hardware architecture, typically called L1, L2, and so on.The main memory is the next level; here, the cost per byte is less than cache or registers, but latency is high. The last level is the secondary memory that has the highest latency with the lowest cost per byte. Overall, the cost per byte of each level determines the capacity and latency, which directly impact in performance.As each level of the hierarchical memory system has a different storage capacity and data is usually kept at the lowest memory level, computational systems must choose for each level which data will be prioritized to stay in memory and which will be removed when that memory level fills up. To do so, the computer memory system employs two fundamental principles, ie, temporal and spatial locality. 2 In general, these strategies aim to keep the most recently used data in the same memory level, since having to access higher memory levels drastically increases the time of the search.Based on the memory hierarchical principles, some researchers have tried to minimize memory system bottlenecks through s...