An early performance evaluation of many integrated core architecture based SGI rackable computing system

Saini, Subhash; Jin, Haoqiang; Jespersen, Dennis C.; Feng, Huiyu; Djomehri, Jahed; Arasin, William; Hood, Robert; Mehrotra, Piyush; Biswas, Rupak

doi:10.1145/2503210.2503272

Cited by 16 publications

(14 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It should be noted that performance of MPI functions in native MIC mode is 3 to 20 times worse than in native host mode as reported by Saini et al [13]. Poor scalability for BT and SP on MIC is because of load imbalance using the pure MPI paradigm.…”

Section: A Nas Parallel Benchmarks 1) Mpi Versionmentioning

confidence: 84%

“…Applications with significant amounts of MPI communication, especially collective communication, perform very poorly on MIC because the performance of MPI functions is 3 to 20 times slower for intra-MIC and 10 to 60 times slower for inter-MIC communication as compared to host [13]. To reduce MPI communication time, we performed optimization by packing and unpacking the MPI messages.…”

Section: Discussionmentioning

confidence: 99%

“…Many researchers investigating KNC performance have examined single nodes, where one or two Xeon processors on a single host are combined with one or two KNC coprocessors [7][8][9][10][11][12][13][14][15]. There are not, however, many publications on programming multiple nodes with MIC coprocessors.…”

Section: Introductionmentioning

confidence: 99%

“…Joo et al demonstrated a fully 'native' multi-node LQCD implementation running entirely on KNC nodes with minimum involvement of the host processor with strong scaling to 3.6 Tflop/s on 64 KNC [11]. Saini et al did performance evaluation of a single MIC using several low level benchmarks and two applications [13]. They measured STREAM bandwidth, load latency, load read and write bandwidth for L1/L2/L3 cache and main memory, MPI latency and bandwidth, and offload bandwidth between host and MIC0/MIC1.…”

Section: Introductionmentioning

confidence: 99%

“…one node (one host + MIC0 + MIC1). In the present paper we evaluate the performance of multiple nodes [13], optimize two large-scale production quality codes and implement a new technique to load balance for symmetric mode, which enhanced their performance significantly.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Early Multi-node Performance Evaluation of a Knights Corner (KNC) Based NASA Supercomputer

Saini

Jin

Jespersen

et al. 2015

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Self Cite

View full text Add to dashboard Cite

We have conducted performance evaluation of a dual-rail Fourteen Data Rate (FDR) InfiniBand (IB) connected cluster, where each node has two Intel Xeon E5-2670 (Sandy Bridge) processors and two Intel Xeon Phi coprocessors. The Xeon Phi, based on the Many Integrated Core (MIC) architecture, is of the Knights Corner (KNC) generation. We used several types of benchmarks for the study. We ran the MPI and multi-zone versions of the NAS Parallel Benchmarks (NPB)-both original and optimized for the Xeon Phi. Among the full-scale benchmarks, we ran two versions of WRF, including one optimized for the MIC, and used a 12 Km Continental U.S (CONUS) data set. We also used original and optimized versions of OVERFLOW and ran with four different datasets to understand scaling in symmetric mode and related load-balancing issues. We present performance for the four different modes of using the host + MIC combination: native host, native MIC, offload, and symmetric. We also discuss the various optimization techniques used in optimizing two of the NPBs for offload mode as well as WRF and OVERFLOW. WRF 3.4 optimized for MIC runs 47% faster than the original NCAR WRF 3.4. The optimized version of OVERFLOW runs 18% faster on the host and the load-balancing strategy used improves the performance on MIC by 5% to 36% depending on the data size. In addition, we discuss the issues related to offload mode and load balancing in symmetric mode.

show abstract

Section: A Nas Parallel Benchmarks 1) Mpi Versionmentioning

confidence: 84%

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Early Multi-node Performance Evaluation of a Knights Corner (KNC) Based NASA Supercomputer

Saini

Jin

Jespersen

et al. 2015

2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Self Cite

View full text Add to dashboard Cite

show abstract

Comparative analysis of coprocessors

Sakdhnagool

Sabne

Eigenmann

2018

Concurrency and Computation

View full text Add to dashboard Cite

While GPUs have seen a steady increase in usage, Xeon Phis have struggled in proving their value, and eventually got discontinued. Is this a matter of the Intel many-core architecture's younger age or are there reasons due to specific features? This paper reviews quantitative information addressing these questions. Using two latest coprocessors, we evaluate performance and programming productivity across a range of microbenchmarks and applications. We consider productivity as the percentage of hand-optimized performance reached by a simple high-level parallel programming model that is translated onto the specific architectures by advanced compilers. We evaluate and compare the performance of the two coprocessors and point out where Xeon Phis fall short. We also briefly review the performance of different execution modes of Xeon Phis.Moreover, unlike common expectation, we found that Xeon Phis' productivity is marginally better that GPUs. The current results suggest that the performance advantage of GPUs outweighs the productivity benefit of Xeon Phis. Closing the performance gap and increasing the productivity benefit that a more regular many-core paradigm can offer will be essential in designing a next-generation architecture.

show abstract

Thread Mapping and Parallel Optimization for MIC Heterogeneous Parallel Systems

Zhu

Wang

et al. 2014

Algorithms and Architectures for Parallel Processing

View full text Add to dashboard Cite

An early performance evaluation of many integrated core architecture based SGI rackable computing system

Cited by 16 publications

References 14 publications

Early Multi-node Performance Evaluation of a Knights Corner (KNC) Based NASA Supercomputer

Early Multi-node Performance Evaluation of a Knights Corner (KNC) Based NASA Supercomputer

Comparative analysis of coprocessors

Thread Mapping and Parallel Optimization for MIC Heterogeneous Parallel Systems

Contact Info

Product

Resources

About