A highly optimized Monte Carlo (MC) code package for simulating light transport is developed on the latest graphics processing unit (GPU) built for general-purpose computing from NVIDIA - the Fermi GPU. In biomedical optics, the MC method is the gold standard approach for simulating light transport in biological tissue, both due to its accuracy and its flexibility in modelling realistic, heterogeneous tissue geometry in 3-D. However, the widespread use of MC simulations in inverse problems, such as treatment planning for PDT, is limited by their long computation time. Despite its parallel nature, optimizing MC code on the GPU has been shown to be a challenge, particularly when the sharing of simulation result matrices among many parallel threads demands the frequent use of atomic instructions to access the slow GPU global memory. This paper proposes an optimization scheme that utilizes the fast shared memory to resolve the performance bottleneck caused by atomic access, and discusses numerous other optimization techniques needed to harness the full potential of the GPU. Using these techniques, a widely accepted MC code package in biophotonics, called MCML, was successfully accelerated on a Fermi GPU by approximately 600x compared to a state-of-the-art Intel Core i7 CPU. A skin model consisting of 7 layers was used as the standard simulation geometry. To demonstrate the possibility of GPU cluster computing, the same GPU code was executed on four GPUs, showing a linear improvement in performance with an increasing number of GPUs. The GPU-based MCML code package, named GPU-MCML, is compatible with a wide range of graphics cards and is released as an open-source software in two versions: an optimized version tuned for high performance and a simplified version for beginners ().
Branch divergence has a significant impact on the performance of GPU programs. We propose two novel softwarebased optimizations, called iteration delaying and branch distribution that aim to reduce branch divergence. Iteration delaying targets a divergent branch enclosed by a loop within a kernel. It improves performance by executing loop iterations that take the same branch direction and delaying those that take the other direction until later iterations. Branch distribution reduces the length of divergent code by factoring out structurally similar code from the branch paths. We conduct a preliminary evaluation of the two optimizations using both synthetic benchmarks and a highlyoptimized real-world application. Our evaluation shows that they improve the performance of the synthetic benchmarks by as much as 30% and 80% respectively, and that of the real-world application by 12% and 16% respectively. KeywordsBranch divergence, GPGPU, Data parallel programming BACKGROUNDThis section briefly describes the CUDA programming model and the architecture of NVIDIA GPUs [13]. In particular, we describe the SIMD execution model and how divergent branches are executed.
Recent improvements in the computing power and programmability of graphics processing units (GPUs) have enabled the possibility of using GPUs for the acceleration of scientific applications, including time-consuming simulations in biomedical optics. This paper describes the acceleration of a standard code for the Monte Carlo (MC) simulation of photons on GPUs. A faster means for performing MC simulations would enable the use of MC-based models for light dose computation in iterative optimization problems such as PDT treatment planning. We describe the computation and how it is mapped onto the many parallel computational units now available on the NVIDIA GTX 200 series GPUs. For a 5 layer skin model simulation, a speedup of 277x was achieved on a single GTX280 GPU over the code executed on an Intel Xeon 5160 processor using 1 CPU core. This approach can be scaled by employing multiple GPUs in a single computer -a 1052x speedup was obtained using 4 GPUs for the same simulation.
Branch divergence can incur a high performance penalty on GPGPU programs. We propose a software optimization, called loop merging, that aims to reduce divergence due to varying trip-count of a loop across warp threads. This optimization merges the divergent loop with one or more outer surrounding loops into one loop. In this way, warp threads do not have to wait for each other in each outer loop iteration, thus improving execution efficiency. We implement loop merging in LLVM. Our evaluation on a Fermi GPU shows that it improves the performance of a synthetic benchmark and five application benchmarks by up to 1.6× and 4.3× respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.