SUMMARYIntel's Xeon Phi is a highly parallel x86 architecture chip made by Intel. It has a number of novel features which make it a particularly challenging target for the compiler writer. This paper describes the techniques used to port the Glasgow Vector Pascal Compiler (VPC) to this architecture and assess its performance by comparisons of the Xeon Phi with 3 other machines running the same algorithms. Copyright c 0000 John Wiley & Sons, Ltd. Nvidia GPU
CONTEXTThis work was done as part of the EU funded CLOPEMA project whose aim is to develop a cloth folding robot using real time stereo vision. At the start of the project we used a Java legacy software package, C3D [1] that is capable of performing the necessary ranging calculations. When processing the robot's modern high resolution images it was prohibitively slow for real time applications, taking about 20 minutes to process a single pair of images.To improve performance, a new Parallel Pyramid Matcher (PPM) was written in Vector Pascal [2] † , using the legacy software as design basis. The new PPM allowed the use of both SIMD and multi-core parallelism [3]. It performs about 20 times faster on commodity PC chips such as the Intel Sandybridge, than the legacy software. With the forthcoming release of the Xeon Phi it was anticipated to be able to obtain further acceleration running the same PPM code on the Xeon Phi. Hence, taking advantage of more cores and wider SIMD registers, whilst relying on the automatic parallelisation feature of the language. The key step in this would be to modify the compiler to produce Xeon Phi code. However, the Xeon Phi turned out to be considerably more complex compared to previous Intel platforms. Porting of the Glasgow Vector Pascal compiler became an entirely new challenge, and required a different porting approach than previous architectures.
PREVIOUS RELATED WORKVector Pascal [4,2] is an array language and as such shares features from other array languages such as APL [5], ZPL [6,7,8] Assignment C [11,12]. The original APL and its descendent J were interpretive languages in which each application of a function to array arguments produced an array result. Whilst it is possible to naively generate a compiler that uses the same approach it is considered inefficient as it leads to the formation of an unnecessary number of array temporaries. This reduces locality of reference and thus cache performance. The key innovation in efficient array language compiler development was Budd's [13] principle to create a single loop nest for each array assignment and to create temporaries as scalar results. This principle was subsequently rediscovered by other implementers of data parallel languages or sub-languages [14]. It has been used in the Saarbrucken [15] Note that the # notation is not supported. Instead index sets are usually elided, provided that the corresponding positions in the arrays are intended. If offsets are intended the index sets can now be explicitly referred to using the predeclared array of index sets iota. iota[0] ...