Compiler-Based Data Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor

Krishnaiyer, Rakesh; Kültürsay, Emre; Chawla, Pankaj; Preis, S.; Zvezdin, A. K.; Saito, Hideki

doi:10.1109/ipdpsw.2013.231

Cited by 35 publications

(29 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, to exploit the Phi's peak performance, applications must fully utilize all cores and their VPUs by keeping them busy throughout the execution cycle. The data has to be ready at the cores' disposal without delivery delays [29].…”

Section: Accepted Manuscriptmentioning

confidence: 99%

Unstructured computational aerodynamics on many integrated core architecture

2016

View full text Add to dashboard Cite

Section: Accepted Manuscriptmentioning

confidence: 99%

Unstructured computational aerodynamics on many integrated core architecture

2016

View full text Add to dashboard Cite

“…The delay can be introduced with the mm delay32 intrinsic, as shown in Listing 1.2, the parameter specifies the number of idle cycles. Streaming Stores Streaming stores can reduce barrier overhead [11] when storing notification values to flags in the "Notification Phase" or reinitializing counters in the combining tree or sense-reversing centralized barrier. Listing 1.3 details the implementation of the mic_store function.…”

Section: Hybrid Barriermentioning

confidence: 99%

Effective Barrier Synchronization on Intel Xeon Phi Coprocessor

Rodchenko

Nisbet

Pop

et al. 2015

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Barriers are a fundamental synchronization primitive, underpinning the parallel execution models of many modern shared-memory parallel programming languages such as OpenMP, OpenCL or Cilk, and are one of the main challenges to scaling. State-of-the-art barrier synchronization algorithms differ in tradeoffs between critical path length, communication traffic patterns and memory footprint. In this paper, we evaluate the efficiency of five such algorithms on the Intel Xeon Phi coprocessor. In addition, we present a novel hybrid barrier implementation that exploits the Xeon Phi's topology, memory hierarchy and streaming stores to achieve a 3× lower overhead than the Intel OpenMP barrier implementation (ICC 14.0.0), thus outperforming, to the best of our knowledge, all other implementations, and which we evaluate on the CG and MG kernels from the NAS Parallel Benchmarks, the direct N-body simulation kernel and the EPCC barrier OpenMP microbenchmark.

show abstract

“…These instructions are hints to the cache hardware, and will not trigger page faults or exceptions if the address supplied is non resident or protected. In [28] a sophisticated strategy is given for making use of these.…”

Section: Pre-fetchingmentioning

confidence: 99%

“…In image processing it cannot be assumed that memory fetches for vector operations will be aligned on 64 byte boundaries as expected by the Intel compilers described in [28]. Image processing routinely involves operations being performed between sub-windows, placed at arbitrary origins within an image.…”

Section: Pre-fetchingmentioning

confidence: 99%

Compiling Vector Pascal to the XeonPhi

Chimeh

Cockshott

Oehler

et al. 2015

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYIntel's Xeon Phi is a highly parallel x86 architecture chip made by Intel. It has a number of novel features which make it a particularly challenging target for the compiler writer. This paper describes the techniques used to port the Glasgow Vector Pascal Compiler (VPC) to this architecture and assess its performance by comparisons of the Xeon Phi with 3 other machines running the same algorithms. Copyright c 0000 John Wiley & Sons, Ltd. Nvidia GPU CONTEXTThis work was done as part of the EU funded CLOPEMA project whose aim is to develop a cloth folding robot using real time stereo vision. At the start of the project we used a Java legacy software package, C3D [1] that is capable of performing the necessary ranging calculations. When processing the robot's modern high resolution images it was prohibitively slow for real time applications, taking about 20 minutes to process a single pair of images.To improve performance, a new Parallel Pyramid Matcher (PPM) was written in Vector Pascal [2] † , using the legacy software as design basis. The new PPM allowed the use of both SIMD and multi-core parallelism [3]. It performs about 20 times faster on commodity PC chips such as the Intel Sandybridge, than the legacy software. With the forthcoming release of the Xeon Phi it was anticipated to be able to obtain further acceleration running the same PPM code on the Xeon Phi. Hence, taking advantage of more cores and wider SIMD registers, whilst relying on the automatic parallelisation feature of the language. The key step in this would be to modify the compiler to produce Xeon Phi code. However, the Xeon Phi turned out to be considerably more complex compared to previous Intel platforms. Porting of the Glasgow Vector Pascal compiler became an entirely new challenge, and required a different porting approach than previous architectures. PREVIOUS RELATED WORKVector Pascal [4,2] is an array language and as such shares features from other array languages such as APL [5], ZPL [6,7,8] Assignment C [11,12]. The original APL and its descendent J were interpretive languages in which each application of a function to array arguments produced an array result. Whilst it is possible to naively generate a compiler that uses the same approach it is considered inefficient as it leads to the formation of an unnecessary number of array temporaries. This reduces locality of reference and thus cache performance. The key innovation in efficient array language compiler development was Budd's [13] principle to create a single loop nest for each array assignment and to create temporaries as scalar results. This principle was subsequently rediscovered by other implementers of data parallel languages or sub-languages [14]. It has been used in the Saarbrucken [15] Note that the # notation is not supported. Instead index sets are usually elided, provided that the corresponding positions in the arrays are intended. If offsets are intended the index sets can now be explicitly referred to using the predeclared array of index sets iota. iota[0] ...

show abstract

Compiler-Based Data Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor

Cited by 35 publications

References 18 publications

Unstructured computational aerodynamics on many integrated core architecture

Unstructured computational aerodynamics on many integrated core architecture

Effective Barrier Synchronization on Intel Xeon Phi Coprocessor

Compiling Vector Pascal to the XeonPhi

Contact Info

Product

Resources

About