Scalable fast multipole methods on distributed heterogeneous architectures

Hu, Qi; Gumerov, Nail A.; Duraiswami, Ramani

doi:10.1145/2063384.2063432

Cited by 39 publications

(47 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The parameter k max used in D-M2L [9] was set to 18. Parameters M.k/, k D 1 to k max , were set to 6,8,12,16,20,26,30,34,38,44, 48, 52, 56, 60, 60, 52, 4, and 2. Odd numbers were avoided for M.k/ to improve the calculation efficiency [16].…”

Section: Cpu Codesmentioning

confidence: 99%

Performance comparison of three types of GPU‐accelerated indirect boundary element method for voxel model analysis

Hamada

2013

Int J Numerical Modelling

View full text Add to dashboard Cite

SUMMARYAn indirect boundary element method that is geared to electrostatic field analysis in voxel models is accelerated by graphics processing units (GPUs). The method considers square walls on cubic voxels as boundary surface elements and uses the fast multipole method (FMM) to analyze large-scale models. On the basis of two conventional CPU codes, three GPU codes are programmed in search of higher computing performance. These GPU codes are designed as follows: In GPU code 1, direct and far fields in the FMM are simultaneously calculated on the GPU and the CPU, respectively; in GPU code 2, both fields are calculated on the GPU with a rotation-coaxial translation-rotation decomposition algorithm; and in GPU code 3, both fields are calculated on the GPU with a diagonal translation scheme. The electric fields in human models are generated by applying a 50-Hz magnetic field or by injecting direct-current (DC) current through two electrodes and they were calculated successfully using a personal computer with three GPUs and six CPU cores. An analysis with 3.9 million surface elements took 89.4 s to solve its governing linear system with double-precision floating-point arithmetic. GPU codes 1, 2, and 3 demonstrated the least memory usage, the greatest speed-up ratio, and the fastest calculation time, respectively. These results show an example of the trade-off relationships of computation performances on a heterogeneous CPU-GPU system.

show abstract

Section: Cpu Codesmentioning

confidence: 99%

Performance comparison of three types of GPU‐accelerated indirect boundary element method for voxel model analysis

Hamada

2013

Int J Numerical Modelling

View full text Add to dashboard Cite

show abstract

“…Starting from [1], we design a new scalable heterogeneous FMM algorithm, which fully distributes all the translations among nodes and substantially decreases its communication costs. This is a consequence of the new data structures which separate the computation and communication to avoid synchronization during GPU computations.…”

Section: A Present Contributionmentioning

confidence: 99%

“…Implementation details for import or export data via LETs are not explicitly described in the well known distributed FMM papers, such as [8], [9], [11], [12]. Recently, [1] developed a distributed FMM algorithms for heterogeneous clusters. However, their algorithm repeated part of translation computations among nodes and required coefficients exchange of all the spatial boxes at the octree's bottom level.…”

Section: Introductionmentioning

confidence: 99%

Scalable Distributed Fast Multipole Methods

Gumerov

Duraiswami

2012

2012 IEEE 14th International Conference on High Performance Computing and Communication &Amp; 2012 IEEE 9th International Confe

Self Cite

View full text Add to dashboard Cite

Abstract-The Fast Multipole Method (FMM) allows O(N ) evaluation to any arbitrary precision of N -body interactions that arises in many scientific contexts. These methods have been parallelized, with a recent set of papers attempting to parallelize them on heterogeneous CPU/GPU architectures [1]. While impressive performance was reported, the algorithms did not demonstrate complete weak or strong scalability. Further, the algorithms were not demonstrated on nonuniform distributions of particles that arise in practice. In this paper, we develop an efficient scalable version of the FMM that can be scaled well on many heterogeneous nodes for nonuniform data. Key contributions of our work are data structures that allow uniform work distribution over multiple computing nodes, and that minimize the communication cost. These new data structures are computed using a parallel algorithm, and only require a small additional computation overhead. Numerical simulations on a heterogeneous cluster empirically demonstrate the performance of our algorithm.

show abstract

“…Special purpose hardware such as graphics processors or heterogeneous CPU/GPU architectures also allow the fast computation of finite sums, either via brute force summation [18], or via the mapping of the FMM onto these architectures [19,20,21,22]. Yokota et al [22] favorably compare a large scale FMMbased vortex element computations with a direct numerical simulation via periodic pseudospectral methods.…”

Section: Introductionmentioning

confidence: 99%

A method to compute periodic sums

Gumerov¹,

Duraiswami²

2014

Journal of Computational Physics

Self Cite

View full text Add to dashboard Cite

In a number of problems in computational physics, a finite sum of kernel functions centered at N particle locations located in a box in three dimensions must be extended by imposing periodic boundary conditions on box boundaries. Even though the finite sum can be efficiently computed via fast summation algorithms, such as the fast multipole method (FMM), the periodized extension is usually treated via a different algorithm, Ewald summation, accelerated via the fast Fourier transform (FFT). A different approach to compute this periodized sum just using a blackbox finite fast summation algorithm is presented in this paper. The method splits the periodized sum in to two parts. The first, comprising the contribution of all points outside a large sphere enclosing the box, and some of its neighbors, is approximated inside the box by a collection of kernel functions ("sources") placed on the surface of the sphere or using an expansion in terms of spectrally convergent local basis functions. The second part, comprising the part inside the sphere, and including the box and its immediate neighborhood, is treated via available summation algorithms. The coefficients of the sources are determined by least squares collocation of the periodicity condition of the total potential, imposed on a circumspherical surface for the box. While the method is presented in general, details are worked out for the case of evaluating electrostatic potentials and forces. Results show that when used with the FMM, the periodized sum can be computed to any specified accuracy, at an additional cost of the order of the free-space FMM. Several technical details and efficient algorithms for auxiliary computations are provided, as are numerical comparisons.

show abstract

Scalable fast multipole methods on distributed heterogeneous architectures

Cited by 39 publications

References 24 publications

Performance comparison of three types of GPU‐accelerated indirect boundary element method for voxel model analysis

Performance comparison of three types of GPU‐accelerated indirect boundary element method for voxel model analysis

Scalable Distributed Fast Multipole Methods

A method to compute periodic sums

Contact Info

Product

Resources

About