Tianyi David Han scite author profile

A highly optimized Monte Carlo (MC) code package for simulating light transport is developed on the latest graphics processing unit (GPU) built for general-purpose computing from NVIDIA - the Fermi GPU. In biomedical optics, the MC method is the gold standard approach for simulating light transport in biological tissue, both due to its accuracy and its flexibility in modelling realistic, heterogeneous tissue geometry in 3-D. However, the widespread use of MC simulations in inverse problems, such as treatment planning for PDT, is limited by their long computation time. Despite its parallel nature, optimizing MC code on the GPU has been shown to be a challenge, particularly when the sharing of simulation result matrices among many parallel threads demands the frequent use of atomic instructions to access the slow GPU global memory. This paper proposes an optimization scheme that utilizes the fast shared memory to resolve the performance bottleneck caused by atomic access, and discusses numerous other optimization techniques needed to harness the full potential of the GPU. Using these techniques, a widely accepted MC code package in biophotonics, called MCML, was successfully accelerated on a Fermi GPU by approximately 600x compared to a state-of-the-art Intel Core i7 CPU. A skin model consisting of 7 layers was used as the standard simulation geometry. To demonstrate the possibility of GPU cluster computing, the same GPU code was executed on four GPUs, showing a linear improvement in performance with an increasing number of GPUs. The GPU-based MCML code package, named GPU-MCML, is compatible with a wide range of graphics cards and is released as an open-source software in two versions: an optimized version tuned for high performance and a simplified version for beginners ().

show abstract

Reducing branch divergence in GPU programs

Han

Abdelrahman

2011

144

View full text Add to dashboard Cite

Branch divergence has a significant impact on the performance of GPU programs. We propose two novel softwarebased optimizations, called iteration delaying and branch distribution that aim to reduce branch divergence. Iteration delaying targets a divergent branch enclosed by a loop within a kernel. It improves performance by executing loop iterations that take the same branch direction and delaying those that take the other direction until later iterations. Branch distribution reduces the length of divergent code by factoring out structurally similar code from the branch paths. We conduct a preliminary evaluation of the two optimizations using both synthetic benchmarks and a highlyoptimized real-world application. Our evaluation shows that they improve the performance of the synthetic benchmarks by as much as 30% and 80% respectively, and that of the real-world application by 12% and 16% respectively. KeywordsBranch divergence, GPGPU, Data parallel programming BACKGROUNDThis section briefly describes the CUDA programming model and the architecture of NVIDIA GPUs [13]. In particular, we describe the SIMD execution model and how divergent branches are executed.

show abstract

GPU-accelerated Monte Carlo simulation for photodynamic therapy treatment planning

Han

Rose

et al. 2009

View full text Add to dashboard Cite

Recent improvements in the computing power and programmability of graphics processing units (GPUs) have enabled the possibility of using GPUs for the acceleration of scientific applications, including time-consuming simulations in biomedical optics. This paper describes the acceleration of a standard code for the Monte Carlo (MC) simulation of photons on GPUs. A faster means for performing MC simulations would enable the use of MC-based models for light dose computation in iterative optimization problems such as PDT treatment planning. We describe the computation and how it is mapped onto the many parallel computational units now available on the NVIDIA GTX 200 series GPUs. For a 5 layer skin model simulation, a speedup of 277x was achieved on a single GTX280 GPU over the code executed on an Intel Xeon 5160 processor using 1 CPU core. This approach can be scaled by employing multiple GPUs in a single computer -a 1052x speedup was obtained using 4 GPUs for the same simulation.

show abstract

Reducing divergence in GPGPU programs with loop merging

Han

Abdelrahman

2013

View full text Add to dashboard Cite

Branch divergence can incur a high performance penalty on GPGPU programs. We propose a software optimization, called loop merging, that aims to reduce divergence due to varying trip-count of a loop across warp threads. This optimization merges the divergent loop with one or more outer surrounding loops into one loop. In this way, warp threads do not have to wait for each other in each outer loop iteration, thus improving execution efficiency. We implement loop merging in LLVM. Our evaluation on a Fermi GPU shows that it improves the performance of a synthetic benchmark and five application benchmarks by up to 1.6× and 4.3× respectively.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Tianyi David Han

hiCUDA: High-Level GPGPU Programming

Next-generation acceleration and code optimization for light transport in turbid media using GPUs

Reducing branch divergence in GPU programs

GPU-accelerated Monte Carlo simulation for photodynamic therapy treatment planning

Reducing divergence in GPGPU programs with loop merging

Contact Info

Product

Resources

About