Auto-vectorization of interleaved data for SIMD

Nuzman, Dorit; Rosen, Ira; Zaks, Ayal

doi:10.1145/1133981.1133997

Cited by 115 publications

(27 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There has been a rich body of compiler research and development on exploiting SIMD parallelism for modern CPU and GPU cores with rich and powerful SIMD hardware support [1,2,3,4,5,7,10,14] through compiler auto-vectorization. However, the modern SIMD architecture poses new constraints such as data alignment, masking for control flow, non-unit stride access to memory, fixedlength nature of SIMD vectors, even though, there has been a huge effort being made in the past decade to address these challenges to certain extent [1,5,7,13,14], often times, the compiler automatic vectorization would still fail to vectorize application programs or fail to generate optimized SIMD code due to various reasons such as compile-time unknown loop trip count, memory access stride and patterns, alignment and control flow complexity.…”

Section: Related Workmentioning

confidence: 99%

“…However, the modern SIMD architecture poses new constraints such as data alignment, masking for control flow, non-unit stride access to memory, fixedlength nature of SIMD vectors, even though, there has been a huge effort being made in the past decade to address these challenges to certain extent [1,5,7,13,14], often times, the compiler automatic vectorization would still fail to vectorize application programs or fail to generate optimized SIMD code due to various reasons such as compile-time unknown loop trip count, memory access stride and patterns, alignment and control flow complexity. When that happens, the programmer would have to perform low-level SIMD intrinsic programming or write inline ASM code in order to utilize SIMD hardware resources effectively [6].…”

Section: Related Workmentioning

confidence: 99%

“…To make efficient use of the underlying SIMD hardware, utilizing its wide vector registers and SIMD instructions --Single Instructions operating on Multiple Data elements packed in wide registers such as AltiVec [2], SSE, AVX and MIC, SIMD vectorization plays a key role of converting plain scalar C/C++ code into SIMD code that operating on vectors of data each holding one or more elements [1,2,5,14].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on Multicore-SIMD Processors

Tian

Saito

Girkar

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops &Amp; PhD Forum

View full text Add to dashboard Cite

SIMD vectorization has received significant attention in the past decade as an important method to accelerate scientific applications, media and embedded applications on SIMD architectures such as Intel ® SSE, AVX, and IBM * AltiVec. However, most of the focus has been directed at loops, effectively executing their iterations on multiple SIMD lanes concurrently relying upon program hints and compiler analysis. This paper presents a set of new C/C++ high-level vector extensions for SIMD programming, and the Intel® C++ product compiler that is extended to translate these vector extensions and produce optimized SIMD instruction sequences of vectorized functions and loops. For a function, our main idea is to vectorize the entire function for callers instead of just vectorizing loops (if any) inside the function. It poses the challenge of dealing with complicated control-flow in the function body, and matching caller and callee for SIMD vector calls while vectorizing caller functions (or loops) and callee functions. Our compilation methods for automatically compiling vector extensions are described. We present performance results of several non-trivial visual computing, computational, and simulation workloads, utilizing SIMD units through the vector extensions on Intel® Multicore 128-bit SIMD processors, and we show that significant SIMD speedups (3.07x to 4.69x) are achieved over the serial execution.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on Multicore-SIMD Processors

Tian

Saito

Girkar

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops &Amp; PhD Forum

View full text Add to dashboard Cite

show abstract

“…Most modern computing platforms have incorporated single-instruction multiple-data (SIMD) extensions into their processors [22] to exploit the natural parallelism of applications if the data can be SIMDized (i.e., if a single instruction can simultaneously operate on a vector of consecutive data). On Cell BE, each vector contains four floating-point numbers that are operated concurrently, and thus the ideal speedup is 4.…”

Section: Single-instruction Multiple-data Parallelismmentioning

confidence: 99%

High-order stencil computations on multicore clusters

Liu

Seymour

Nomura

et al. 2009

2009 IEEE International Symposium on Parallel &Amp; Distributed Processing

View full text Add to dashboard Cite

show abstract

“…It is particularly severe for SIMD units when they need to collect data from multiple strided addresses for parallel processing. Data permutations improving the data layout are an active area of research [15,13]. Intel Array Building Blocks [6] also perform dynamic layout transformations of user data for better parallel processing speed.…”

Section: Introductionmentioning

confidence: 99%

Data layout optimization for multi-valued containers in OpenCL

Strzodka¹

2012

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Auto-vectorization of interleaved data for SIMD

Cited by 115 publications

References 31 publications

Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on Multicore-SIMD Processors

Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on Multicore-SIMD Processors

High-order stencil computations on multicore clusters

Data layout optimization for multi-valued containers in OpenCL

Contact Info

Product

Resources

About