In this paper, a full realization of the higher order method of moments (HMoM) with a parallel out-of-core LU solver on GPU/CPU platform is presented in detail, mainly including three parts: In the first part, both global-auxiliary table and local-auxiliary table are introduced for reducing a lot of tedious and repetitive calculations, and then a realization for GPU-oriented programming is proposed and optimized. In the second part, an overlapped grouping of all the curved quadrilaterals is proposed. With this scheme, all the submatrices can be efficiently generated one by one without wasting any calculations with the help of both the video memory and the host memory. In the third part, a GPU-based out-of-core algorithm for LU decomposition is proposed and further developed into a hybrid GPU/CPU algorithm. Numerical examples are provided to test the robustness of the proposed algorithm by comparison with the measurement and/or the traditional MoM with RWG basis functions, and to demonstrate the overall performance of the proposed algorithm by comparison with the existing algorithm for dealing with similar problems. The speedup ratio of the proposed algorithm for generating the HMoM matrix can achieve about from 7 to 12 compared with the GPU-based algorithm in literatures. Also compared with the 8-threaded CPU-based algorithm, the speedup ratio of the proposed algorithm for LU decomposition can exceed 13 for the single precision case and 7 for the double precision case.
Index Terms-CUDA, GPU, high-order basis function, method of moments (MoM), OpenMP, out-of-core LU solver, parallel algorithm, speedup ratio. I. INTRODUCTION T HE method of moments (MoM) has gained wide applications in electromagnetic (EM) computations since Harrington published his monograph [1], especially after Rao-Wilton-Glisson (RWG) basis functions [2] (a type of local basis functions) were constructed. The early MoM mainly adopted such lower order basis functions as RWG basis functions, and hence, the number of unknowns is naturally larger, especially for electrically large problems. The MoM matrix is a dense matrix, leading to higher computational complexity and storage complexity. For this problem, many fast algorithms based on the MoM have emerged, such as the FMM [3], [4], the MLFMA [5], [6], the AIM [7], the P-FFT [8], the IE-FFT [9], the FG-FFT [10], the FGG-FG-FFT [11], and so on. These