Efficient method of moment simulation based on higher order bases and CPU/GPU parallelization

Kolundžija, Branko M.; Olćan, Dragan I.; Zoric, Dusan

doi:10.1109/aps.2012.6348419

Cited by 1 publication

(7 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The authors of [25], [26] used a thread to deal with those integrations caused by one pair of patches, which is ideologically similar to the method in [27]. Note that the method in [27] is for the RWG basis functions rather than the higher order basis functions.…”

Section: Strategy Comparisonmentioning

confidence: 99%

“…For example, when when . If all the integrations caused by a pair of patches were assigned to one thread as in [25], [26], then at least single-precision or double-precision units in the on-chip memory of this thread should be enabled for the high efficiency, which is not consistent with GPU-oriented programming. In that case, the NVCC complier will automatically enable the local memory of this thread as a supplement [21], leading to the significant access latency because the local memory is on-board like the GPU global memory, which is harmful to the performance of the whole program.…”

Section: Strategy Comparisonmentioning

confidence: 99%

“…In fact, this speed is several times faster than that of the shared memory of a block [23]. Unlike the method in [25], [26], all the integrations caused by a pair of patches are assigned to one block, and most relevant data are precalculated and stored in the GPU global memory and the shared memory. Each warp in the block can quickly reads the shared memory to get the required data by using broadcast.…”

Section: Strategy Comparisonmentioning

confidence: 99%

“…Finally, the shared memory of a block is already large enough to store the data related to the extractions of singularity and the analytical calculations, so the calculations of all the integrals (whether they are regular or singular) will be assigned to the GPU in our method rather than to the multicore CPU. In Section V, it will be demonstrated that our method is much faster than that in [25], [26].…”

Section: Strategy Comparisonmentioning

confidence: 99%

“…Early GPUs are mainly for image processing, while modern GPUs are equipped with a common programming interface such as NVIDIA's CUDA [21]- [23] (not requiring programmers to master a lot of graphics knowledge), and are especially suitable for massively parallel numerical computations. Then how to fully utilize GPU to improve the efficiency of the HMoM becomes a concern in recent years [25], [26]. Of course, besides the HMoM, other EM algorithms, such as the MoM and its fast algorithms [27]- [32] and the FDTD [34], [35], can also be accelerated by using GPU.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Higher Order Method of Moments With a Parallel Out-of-Core LU Solver on GPU/CPU Platform

Zhou

Chen

et al. 2014

IEEE Trans. Antennas Propagat.

View full text Add to dashboard Cite

In this paper, a full realization of the higher order method of moments (HMoM) with a parallel out-of-core LU solver on GPU/CPU platform is presented in detail, mainly including three parts: In the first part, both global-auxiliary table and local-auxiliary table are introduced for reducing a lot of tedious and repetitive calculations, and then a realization for GPU-oriented programming is proposed and optimized. In the second part, an overlapped grouping of all the curved quadrilaterals is proposed. With this scheme, all the submatrices can be efficiently generated one by one without wasting any calculations with the help of both the video memory and the host memory. In the third part, a GPU-based out-of-core algorithm for LU decomposition is proposed and further developed into a hybrid GPU/CPU algorithm. Numerical examples are provided to test the robustness of the proposed algorithm by comparison with the measurement and/or the traditional MoM with RWG basis functions, and to demonstrate the overall performance of the proposed algorithm by comparison with the existing algorithm for dealing with similar problems. The speedup ratio of the proposed algorithm for generating the HMoM matrix can achieve about from 7 to 12 compared with the GPU-based algorithm in literatures. Also compared with the 8-threaded CPU-based algorithm, the speedup ratio of the proposed algorithm for LU decomposition can exceed 13 for the single precision case and 7 for the double precision case. Index Terms-CUDA, GPU, high-order basis function, method of moments (MoM), OpenMP, out-of-core LU solver, parallel algorithm, speedup ratio. I. INTRODUCTION T HE method of moments (MoM) has gained wide applications in electromagnetic (EM) computations since Harrington published his monograph [1], especially after Rao-Wilton-Glisson (RWG) basis functions [2] (a type of local basis functions) were constructed. The early MoM mainly adopted such lower order basis functions as RWG basis functions, and hence, the number of unknowns is naturally larger, especially for electrically large problems. The MoM matrix is a dense matrix, leading to higher computational complexity and storage complexity. For this problem, many fast algorithms based on the MoM have emerged, such as the FMM [3], [4], the MLFMA [5], [6], the AIM [7], the P-FFT [8], the IE-FFT [9], the FG-FFT [10], the FGG-FG-FFT [11], and so on. These

show abstract