A SIMD optimization framework for retargetable compilers

Hohenauer, Manuel; Engel, Felix; Leupers, Rainer; Ascheid, Gerd; Meyr, H.

doi:10.1145/1509864.1509866

Cited by 18 publications

(8 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They employ a two-phase source-to-source optimization strategy. In [20], the authors have proposed a retargetable SIMD code optimization framework that is integrated into an industrial retargetable C compiler. for AMD processors, VIS for SUN SPARC, AltiVec/VSX for POWER, and NEON for ARM.…”

Section: Background and Related Workmentioning

confidence: 99%

Evaluating vector data type usage in OpenCL kernels

Fang

Vărbănescu

Liao

et al. 2014

Concurrency and Computation

View full text Add to dashboard Cite

Open Computing Language (OpenCL) is an open, functionally portable programming model for a large range of highly parallel processors. To provide users with access to the underlying platforms, OpenCL has explicit support for features such as local memory and vector data types (VDTs). However, these are often low-level, hardware-specific features, which can be detrimental to performance on different platforms. In this paper, we focus on VDTs and investigate their usage in a systematic way. First, we propose two different approaches (inter-vdt and intra-vdt) to use VDTs in OpenCL kernels, and show how to translate scalar OpenCL kernels to vectorized ones. After obtaining vectorized code, we evaluate the performance effects of using VDTs with two types of benchmarks: micro-benchmarks and macro-benchmarks. With microbenchmarks, we study the execution model of VDTs and the role of the compiler-aided vectorizer on five devices. With macro-benchmarks, we explore the changes of memory access patterns before and after using VDTs, and the resulting performance impact. Not only our evaluation provides insights into how OpenCL's VDTs are mapped on different processors, but it also indicates that using such data types introduces changes in both computation and memory accesses. Based on the lessons learned, we discuss how to deal with performance portability in the presence of VDTs.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Evaluating vector data type usage in OpenCL kernels

Fang

Vărbănescu

Liao

et al. 2014

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…There has been significant recent work in generating effectice code for SIMD vector instruction sets in the presence of hardware alignment and stride constraints as described in [12,44,45,31,13]. The difficulties of optimizing for a wide range of SIMD vector architectures are discussed in [29,14]. In addition, several other works have addressed the exploitation of SIMD instruction sets [22,24,23,30,32,31,28].…”

Section: Related Workmentioning

confidence: 99%

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Henretty

Stock

Pouchet

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on several modern processors with SIMD capabilities.

show abstract

“…However, only small kernels (max size: 64-point FFT) are investigated and the overall scalability of their solution to larger vector widths and larger kernels is not addressed. The difficulties of optimizing for a wide range of SIMD vector architectures are well explored in [27,16].…”

Section: Related Workmentioning

confidence: 99%

Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets

McFarlin

Arbatov

Franchetti

et al. 2011

Proceedings of the International Conference on Supercomputing

View full text Add to dashboard Cite

The well-known shift to parallelism in CPUs is often associated with multicores. However another trend is equally salient: the increasing parallelism in per-core single-instruction multiple-date (SIMD) vector units. Intel's SSE and IBM's VMX (compatible to AltiVec) both offer 4-way (single precision) floating point, but the recent Intel instruction sets AVX and Larrabee (LRB) offer 8-way and 16-way, respectively. Compilation and optimization for vector extensions is hard, and often the achievable speed-up by using vectorizing compilers is small compared to hand-optimization using intrinsic function interfaces. Unfortunately, the complexity of these intrinsics interfaces increases considerably with the vector length, making hand-optimization a nightmare. In this paper, we present a peephole-based vectorization system that takes as input the vector instruction semantics and outputs a library of basic data reorganization blocks such as small transpositions and perfect shuffles that are needed in a variety of high performance computing applications. We evaluate the system by generating the blocks needed by the program generator Spiral for vectorized fast Fourier transforms (FFTs). With the generated FFTs we achieve a vectorization speedup of 5.5-6.5 for 8-way AVX and 10-12.5 for 16-way LRB. For the latter instruction counts are used since no timing information is available. The combination of the proposed system and Spiral thus automates the production of high performance FFTs for current and future vector architectures.

show abstract

A SIMD optimization framework for retargetable compilers

Cited by 18 publications

References 38 publications

Evaluating vector data type usage in OpenCL kernels

Evaluating vector data type usage in OpenCL kernels

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets

Contact Info

Product

Resources

About