Evaluation of Rodinia Codes on Intel Xeon Phi

Misra, Goldi; Kurkure, Nisha; Das, Abhishek; Valmiki, Manjunatha; Das, Shirshendu; Gupta, Anil Kumar

doi:10.1109/isms.2013.118

Cited by 27 publications

(7 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this way, we will be able to make fair comparisons in the sense that any differences in performance can be related to the programming model and not to the algorithm. By using the latest Rodinia version 3.1, we assume that we are already starting from a decent baseline since these benchmarks were optimized many times in several research works [28] [29]. The kernels were developed in GLSL and their corresponding SPIR-V binaries were automatically generated using the glslangvalidator compiler [30] provided by Khronos.…”

Section: B Vcomputebench Benchmarksmentioning

confidence: 99%

VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs

Mammeri¹,

Juurlink²

2018

2018 IEEE International Symposium on Workload Characterization (IISWC)

View full text Add to dashboard Cite

Section: B Vcomputebench Benchmarksmentioning

confidence: 99%

VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs

Mammeri¹,

Juurlink²

2018

2018 IEEE International Symposium on Workload Characterization (IISWC)

View full text Add to dashboard Cite

“…The recent introduction of the Intel R Xeon Phi TM , which is based on one of the latest accelerator technologies, has motivated several related publications that focus primarily on its use in HPC clusters, e.g., on application porting, optimization, performance evaluation, and energy consumption. For example, in terms of performance of the Xeon Phi TM , Misra, et al [21] compare the performance of a standard Intel R Xeon CPU to that of a Xeon Phi TM using two applications from the Rodinia benchmark suite, LU and HotSpot. Their results show that the Xeon CPU significantly outperforms the Phi TM (with the best configuration), which took almost 8X more time to execute LU and 2X more to execute HotSpot.…”

Section: Related Workmentioning

confidence: 99%

“…Although they were developed to evaluate heterogeneous multi-core systems, they are suitable for evaluating many-core homogeneous systems such as the Intel R Xeon Phi TM when used in native mode. Since the Rodinia benchmarks have been used to evaluate GPU-based platforms, there is a significant amount of data available to compare performance across platforms [21].…”

Section: Applicationsmentioning

confidence: 99%

Minimization of Xeon Phi Core Use with Negligible Execution Time Impact

Barranco

Teller

Gerndt

2016

Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale

View full text Add to dashboard Cite

For many years GPUs have been components of HPC clusters (Titan and Piz Daint), while only in recent years has the Intel R Xeon Phi TM been included (Tianhe-2 and Stampede). For example, GPUs are in 14% of systems in the November 2015 Top500 list, while the Xeon Phi TM is in 6%. Intel R came out with Xeon Phi TM to compete with NVIDIA GPUs by offering a unified environment that supports OpenMP and MPI, and by providing competitive and easier-to-utilize processing power with less energy consumption. Maximum Xeon Phi TM execution-time performance requires that programs have high data parallelism and good scalability, and use parallel algorithms. And, improved Phi TM power performance and throughput can be achieved by reducing the number of cores employed for application execution. Accordingly, the objectives of this paper are to: (1) Demonstrate that some applications can be executed with fewer cores than are available to users with a negligible impact on execution time: For 59.3% of the 27 application instances studied, doing this results in better performance and for 37% using less than half of the available cores results in performance degradation of not more than 10% in the worst case. (2) Develop a tool that provides the user with the optimal number of cores to employ: We designed an algorithm and developed a plugin for the Periscope Tuning Framework, an automatic performance tuner, that for a given application provide the user with an estimation of this number. (3) Understand if performance metrics can be used to identify applications that can be executed with fewer cores with a negligible impact on execution time: We identified, via statistical analysis, the following three metrics that are indicative of this, at least for the application instances studied: low L1 Compute to Data Access ratio, i.e., the average number of computations that are performed per byte of data loaded/stored in the L1 cache, high use of data bandwidth, and, to a lesser extent, low vectorization intensity.

show abstract

“…1). A 3D-stencil kernel with "degree = 1" expresses a 3D Jacobi solver from the Rodinia benchmark suite [17], which is used for the evaluation of heterogeneous computing [18], [19]. …”

Section: Stencil Computationmentioning

confidence: 99%

Performance Evaluation of a 3D-Stencil Library for Distributed Memory Array Accelerators

Inagaki

Takamaeda-Yamazaki

et al. 2015

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYThe Energy-aware Multi-mode Accelerator eXtension [24], [25] (EMAX) is equipped with distributed single-port local memories and ring-formed interconnections. The accelerator is designed to achieve extremely high throughput for scientific computations, big data, and image processing as well as low-power consumption. However, before mapping algorithms on the accelerator, application developers require sufficient knowledge of the hardware organization and specially designed instructions. They also need significant effort to tune the code for improving execution efficiency when no well-designed compiler or library is available. To address this problem, we focus on library support for stencil (nearest-neighbor) computations that represent a class of algorithms commonly used in many partial differential equation (PDE) solvers. In this research, we address the following topics: (1) system configuration, features, and mnemonics of EMAX; (2) instruction mapping techniques that reduce the amount of data to be read from the main memory; (3) performance evaluation of the library for PDE solvers. With the features of a library that can reuse the local data across the outer loop iterations and map many instructions by unrolling the outer loops, the amount of data to be read from the main memory is significantly reduced to a minimum of 1/7 compared with a hand-tuned code. In addition, the stencil library reduced the execution time 23% more than a general-purpose processor.

show abstract

Evaluation of Rodinia Codes on Intel Xeon Phi

Cited by 27 publications

References 1 publication

VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs

VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs

Minimization of Xeon Phi Core Use with Negligible Execution Time Impact

Performance Evaluation of a 3D-Stencil Library for Distributed Memory Array Accelerators

Contact Info

Product

Resources

About