Kepler GPU vs. Xeon Phi: Performance case study with a high-order CFD application

Deng, Liang; Bai, Hanli; Zhao, Dan; Wang, Fang

doi:10.1109/compcomm.2015.7387546

Cited by 5 publications

(5 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Loop-optimizations unrolling, 9,17,23,24,29,50,84,90 collapsing, 4,6,7,13,20,21,44,54 splitting 22,28 Blocking (tiling) in cache, 14,15,18,[20][21][22]27,39,44,52,54,69 registers 68,69 Compile-time optimizations using pre-computed values, 35,52 specifying array and loop bounds at compile time 6,54 Compute-related optimizations Reusing intermediate variables, 22,35 using conflict-detection instruction of AVX-512, 52,85 performing redundant computation to avoid data-communication or atomic operations 52,82 Array transpose 6, 79…”

Section: Ta B L E 3 Optimization Strategiesmentioning

confidence: 99%

“…Overall, Phi does not provide comparable performance to CPU as a stand-alone shared memory processor. Thread-affinity strategy Balanced, 4,14,20,21,23,26,36 scatter, 37 compact, 92 no single winner 13,84,94 Memory mode Cache, 60,62 flat, 55,90,96 hybrid (none), no single winner [10][11][12]52,54,57,97 Interconnect clustering mode All-to-all (none), quadrant, 11,62 sub-NUMA, 10,55,57,96,97 no single winner 52,96…”

Section: Gaining Insights Into Phi Architecturementioning

confidence: 99%

“…,[12][13][14][16][17][18]20,21,24,[27][28][29][30][33][34][35][36][37]39,[42][43][44][45][46][47][48]50,52,55,57,59,63,66,84,86,90,92,93,95,99 IntelMKL 2,17,19,31,32,40,93,99 …”

mentioning

confidence: 99%

See 2 more Smart Citations

A survey on evaluating and optimizing performance of Intel Xeon Phi

Mittal

2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary Intel's Xeon Phi combines the parallel processing power of a many‐core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for researchers and developers in the area of computer‐architecture and high‐performance computing.

show abstract

Section: Ta B L E 3 Optimization Strategiesmentioning

confidence: 99%

Section: Gaining Insights Into Phi Architecturementioning

confidence: 99%

See 1 more Smart Citation

A survey on evaluating and optimizing performance of Intel Xeon Phi

Mittal

2020

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Because each accelerator has its advantages and disadvantages for certain classes of problems [22,3,17], selecting the best option for a given application is key when searching for maximum performance. To provide some guidelines for such selection, this article presents a comparative analysis between two different HPC architectures (Intel Xeon Phi KNL vs. NVIDIA Pascal).…”

Section: Introductionmentioning

confidence: 99%

Comparison of HPC Architectures for Computing All-Pairs Shortest Paths. Intel Xeon Phi KNL vs NVIDIA Pascal

Costanzo

Rucci

Costi

et al. 2021

Communications in Computer and Information Science

View full text Add to dashboard Cite

Today, one of the main challenges for high-performance computing systems is to improve their performance by keeping energy consumption at acceptable levels. In this context, a consolidated strategy consists of using accelerators such as GPUs or many-core Intel Xeon Phi processors. In this work, devices of the NVIDIA Pascal and Intel Xeon Phi Knights Landing architectures are described and compared. Selecting the Floyd-Warshall algorithm as a representative case of graph and memory-bound applications, optimized implementations were developed to analyze and compare performance and energy efficiency on both devices. As it was expected, Xeon Phi showed superior when considering double-precision data. However, contrary to what was considered in our preliminary analysis, it was found that the performance and energy efficiency of both devices were comparable using single-precision datatype.

show abstract

“…As a case study, Ref. [8] compared the performance of high-order weighted essentially non-oscillatory scheme CFD application on both K20c GPU and Xeon Phi 31SP MIC, and the result showed that when vector processing units are fully utilized the MIC can achieve equivalent performance to that of GPUs. Ref.…”

Section: Introductionmentioning

confidence: 99%

Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores

et al. 2018

View full text Add to dashboard Cite

For computational fluid dynamics (CFD) applications with a large number of grid points/cells, parallel computing is a common efficient strategy to reduce the computational time. How to achieve the best performance in the modern supercomputer system, especially with heterogeneous computing resources such as hybrid CPU+GPU, or a CPU + Intel Xeon Phi (MIC) co-processors, is still a great challenge. An in-house parallel CFD code capable of simulating three dimensional structured grid applications is developed and tested in this study. Several methods of parallelization, performance optimization and code tuning both in the CPU-only homogeneous system and in the heterogeneous system are proposed based on identifying potential parallelism of applications, balancing the work load among all kinds of computing devices, tuning the multi-thread code toward better performance in intra-machine node with hundreds of CPU/MIC cores, and optimizing the communication among inter-nodes, inter-cores, and between CPUs and MICs. Some benchmark cases from model and/or industrial CFD applications are tested on the Tianhe-1A and Tianhe-2 supercomputer to evaluate the performance. Among these CFD cases, the maximum number of grid cells reached 780 billion. The tuned solver successfully scales up to half of the entire Tianhe-2 supercomputer system with over 1.376 million of heterogeneous cores. The test results and performance analysis are discussed in detail.

show abstract

Kepler GPU vs. Xeon Phi: Performance case study with a high-order CFD application

Cited by 5 publications

References 12 publications

A survey on evaluating and optimizing performance of Intel Xeon Phi

A survey on evaluating and optimizing performance of Intel Xeon Phi

Comparison of HPC Architectures for Computing All-Pairs Shortest Paths. Intel Xeon Phi KNL vs NVIDIA Pascal

Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores

Contact Info

Product

Resources

About