GPU-accelerated Monte Carlo simulation for photodynamic therapy treatment planning

Lo, William Chun Yip; Han, Tianyi David; Rose, Jonathan; Lilge, Lothar

doi:10.1117/12.831944

Cited by 16 publications

(13 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use two synthetic benchmarks (one for each optimization) and one highly-optimized realworld application called Monte Carlo simulation for Multi-Layered media (MCML) [9]. We parameterize the synthetic benchmarks to explore the impact of various kernel characteristics on the benefit of these optimizations.…”

Section: Introductionmentioning

confidence: 99%

Reducing branch divergence in GPU programs

Han

Abdelrahman

2011

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

Self Cite

144

View full text Add to dashboard Cite

Branch divergence has a significant impact on the performance of GPU programs. We propose two novel softwarebased optimizations, called iteration delaying and branch distribution that aim to reduce branch divergence. Iteration delaying targets a divergent branch enclosed by a loop within a kernel. It improves performance by executing loop iterations that take the same branch direction and delaying those that take the other direction until later iterations. Branch distribution reduces the length of divergent code by factoring out structurally similar code from the branch paths. We conduct a preliminary evaluation of the two optimizations using both synthetic benchmarks and a highlyoptimized real-world application. Our evaluation shows that they improve the performance of the synthetic benchmarks by as much as 30% and 80% respectively, and that of the real-world application by 12% and 16% respectively. KeywordsBranch divergence, GPGPU, Data parallel programming BACKGROUNDThis section briefly describes the CUDA programming model and the architecture of NVIDIA GPUs [13]. In particular, we describe the SIMD execution model and how divergent branches are executed.

show abstract

Section: Introductionmentioning

confidence: 99%

Reducing branch divergence in GPU programs

Han

Abdelrahman

2011

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

Self Cite

144

View full text Add to dashboard Cite

show abstract

“…Unless each thread uses a unique sequence of random numbers, there is a risk that multiple threads will simply re-calculate one another’s results, which would affect the signal-to-noise ratio in the resulting simulation output. Simply seeding the PRNG state differently for each thread, an approach taken in [11, 16, 18], is not sufficient to ensure against inter-thread correlation of random numbers. The GPU implementation of the Mersenne Twister (MT) PRNG used by Fang and Boas [16] provides unique random numbers for threads within a block but still potentially suffers from correlation between different thread blocks.…”

mentioning

confidence: 99%

Next-generation acceleration and code optimization for light transport in turbid media using GPUs

Alerstam

Han

et al. 2010

Biomed. Opt. Express

Self Cite

157

107

View full text Add to dashboard Cite

A highly optimized Monte Carlo (MC) code package for simulating light transport is developed on the latest graphics processing unit (GPU) built for general-purpose computing from NVIDIA - the Fermi GPU. In biomedical optics, the MC method is the gold standard approach for simulating light transport in biological tissue, both due to its accuracy and its flexibility in modelling realistic, heterogeneous tissue geometry in 3-D. However, the widespread use of MC simulations in inverse problems, such as treatment planning for PDT, is limited by their long computation time. Despite its parallel nature, optimizing MC code on the GPU has been shown to be a challenge, particularly when the sharing of simulation result matrices among many parallel threads demands the frequent use of atomic instructions to access the slow GPU global memory. This paper proposes an optimization scheme that utilizes the fast shared memory to resolve the performance bottleneck caused by atomic access, and discusses numerous other optimization techniques needed to harness the full potential of the GPU. Using these techniques, a widely accepted MC code package in biophotonics, called MCML, was successfully accelerated on a Fermi GPU by approximately 600x compared to a state-of-the-art Intel Core i7 CPU. A skin model consisting of 7 layers was used as the standard simulation geometry. To demonstrate the possibility of GPU cluster computing, the same GPU code was executed on four GPUs, showing a linear improvement in performance with an increasing number of GPUs. The GPU-based MCML code package, named GPU-MCML, is compatible with a wide range of graphics cards and is released as an open-source software in two versions: an optimized version tuned for high performance and a simplified version for beginners ().

show abstract

“…There has been a lot of recent activity in adapting Monte Carlo transport algorithms to streaming Downloaded by [Swinburne University of Technology] at 03:52 03 January 2015 processors (in the radiation treatment planning community see for instance, Badal and Badano, 2009b;Lo et al, 2009;Jia et al, 2010;Tickner, 2010;and in neutronics, Nelson and Ivanov, 2010;Aiping Ding et al, 2011). Although a Monte Carlo simulation is embarrassingly parallel, it does not map well to this type of architecture, which groups processing threads into a large number of Single Instruction Multiple Data (SIMD) like instruction units.…”

Section: Introductionmentioning

confidence: 98%

A Coarse Grained Particle Transport Solver Designed Specifically for Graphics Processing Units

Heerden

2012

Transport Theory and Statistical Physics

View full text Add to dashboard Cite

This article introduces a novel coarse-grained particle transport solver, designed specifically for streaming processor architectures. The coarse particles are transported using a Monte Carlo algorithm with a locally homogenized collision operator. Local errors introduced by the homogenization procedure and the use of (deterministic) quadratures, are described and analyzed. A brief description of how the simulation is mapped to the streaming processor (Graphics Processing Unit) is also given.

show abstract

GPU-accelerated Monte Carlo simulation for photodynamic therapy treatment planning

Cited by 16 publications

References 7 publications

Reducing branch divergence in GPU programs

Reducing branch divergence in GPU programs

Next-generation acceleration and code optimization for light transport in turbid media using GPUs

A Coarse Grained Particle Transport Solver Designed Specifically for Graphics Processing Units

Contact Info

Product

Resources

About