Performance analysis of SSE and AVX instructions in multi-core CPUs and GPU computing on FDTD scheme for solid and fluid vibration problems

Francés, Jorge; Bleda, Sergio; Márquez, Andrés; Neipp, Cristian; Otero, Beatriz

doi:10.1007/s11227-013-1065-x

Cited by 8 publications

(6 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is worth to note that the scaling factor and the relationship between spatial and time resolutions have been carefully chosen in order to avoid rounding and finite precision errors [15]. The usage of double precision was experimentally proven not to improve significantly the accuracy of the results obtained but it implied a dramatic downside effect in terms of computational resources.…”

Section: Multi-cpu Approach Of the Fdtd Methodsmentioning

confidence: 99%

“…Regarding this aspect, some works related with GPU computing and FDTD in the field of Electromagnetics have been developed [11][12][13]. For FDTD and GPU computing applied to vibration problems there are some contributions related with seismology [14] and also for vibroacoustics [15]. The application of multi-GPU has been applied to FDTD and Electromagnetics in [16,17] but an accurate performance analysis of multi-CPU FDTD code that uses SSE and AVX instructions compared to a multi-GPU version with Peer-to-Peer communication has not been carried out to the best of our knowledge.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications

Francés

Otero

Bleda

et al. 2015

Computer Physics Communications

View full text Add to dashboard Cite

a b s t r a c tThe Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bidimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications.

show abstract

Section: Multi-cpu Approach Of the Fdtd Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications

Francés

Otero

Bleda

et al. 2015

Computer Physics Communications

View full text Add to dashboard Cite

show abstract

“…In order to achieve this reduction parallel strategies have been considered. 11,20,21 Basically, the auto-vectorisation performed by the compiler and OpenMP have been considered. For enabling properly the auto-vectorisation an efficient memory alignment, a correct loop count and proper structures were considered.…”

Section: One-dimensional Nonlinear Coated Binary Gratingmentioning

confidence: 99%

Efficient split field FDTD analysis of third-order nonlinear materials in two-dimensionally periodic media

et al. 2016

Self Cite

View full text Add to dashboard Cite

In this work the split-field finite-difference time-domain method (SF-FDTD) has been extended for the analysis of two-dimensionally periodic structures with third-order nonlinear media. The accuracy of the method is verified by comparisons with the nonlinear Fourier Modal Method (FMM). Once the formalism has been validated, examples of one-and two-dimensional nonlinear gratings are analysed. Regarding the 2D case, the shifting in resonant waveguides is corroborated. Here, not only the scalar Kerr effect is considered, the tensorial nature of the third-order nonlinear susceptibility is also included. The consideration of nonlinear materials in this kind of devices permits to design tunable devices such as variable band filters. However, the third-order nonlinear susceptibility is usually small and high intensities are needed in order to trigger the nonlinear effect. Here, a one-dimensional CBG is analysed in both linear and nonlinear regime and the shifting of the resonance peaks in both TE and TM are achieved numerically. The application of a numerical method based on the finitedifference time-domain method permits to analyse this issue from the time domain, thus bistability curves are also computed by means of the numerical method. These curves show how the nonlinear effect modifies the properties of the structure as a function of variable input pump field. When taking the nonlinear behaviour into account, the estimation of the electric field components becomes more challenging. In this paper, we present a set of acceleration strategies based on parallel software and hardware solutions.

show abstract

“…Vectorization is the process by which the implementation of an algorithm is converted from scalar to vectorial such that one single operation is executed over a group of contiguous values, all at the same time. In our particular case, the vectorization only applies to large floating point operations (inner loop) [23]. Thus, when the loops are collapsed, the granularity is reduced and the vectorization could not be applied; that is, by collapsing the loops, computing is insufficient for vectorization.…”

Section: Experiments On Cpu the Configuration Of The Worktation Ismentioning

confidence: 99%

Performance of a Code Migration for the Simulation of Supersonic Ejector Flow to SMP, MIC, and GPU Using OpenMP, OpenMP+LEO, and OpenACC Directives

Couder-Castañeda

Barrios-Piña

Gitler

2015

Scientific Programming

View full text Add to dashboard Cite

A serial source code for simulating a supersonic ejector flow is accelerated using parallelization based on OpenMP and OpenACC directives. The purpose is to reduce the development costs and to simplify the maintenance of the application due to the complexity of the FORTRAN source code. This research follows well-proven strategies in order to obtain the best performance in both OpenMP and OpenACC. OpenMP has become the programming standard for scientific multicore software and OpenACC is one true alternative for graphics accelerators without the need of programming low level kernels. The strategies using OpenMP are oriented towards reducing the creation of parallel regions, tasks creation to handle boundary conditions, and a nested control of the loop time for the programming in offload mode specifically for the Xeon Phi. In OpenACC, the strategy focuses on maintaining the data regions among the executions of the kernels. Experiments for performance and validation are conducted here on a 12-core Xeon CPU, Xeon Phi 5110p, and Tesla C2070, obtaining the best performance from the latter. The Tesla C2070 presented an acceleration factor of 9.86X, 1.6X, and 4.5X compared against the serial version on CPU, 12-core Xeon CPU, and Xeon Phi, respectively.

show abstract

Performance analysis of SSE and AVX instructions in multi-core CPUs and GPU computing on FDTD scheme for solid and fluid vibration problems

Cited by 8 publications

References 18 publications

Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications

Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications

Efficient split field FDTD analysis of third-order nonlinear materials in two-dimensionally periodic media

Performance of a Code Migration for the Simulation of Supersonic Ejector Flow to SMP, MIC, and GPU Using OpenMP, OpenMP+LEO, and OpenACC Directives

Contact Info

Product

Resources

About