Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation 2006
DOI: 10.1145/1133981.1133997
|View full text |Cite
|
Sign up to set email alerts
|

Auto-vectorization of interleaved data for SIMD

Abstract: Most implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data reorganization manipulations. Computations on non-contiguous and especially interleaved data appear in important applications, which can greatly benefit from SIMD instructions once the data is reorganized properly. Vectorizing such computations efficiently is therefore an am… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
22
0

Year Published

2009
2009
2021
2021

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 115 publications
(27 citation statements)
references
References 31 publications
0
22
0
Order By: Relevance
“…There has been a rich body of compiler research and development on exploiting SIMD parallelism for modern CPU and GPU cores with rich and powerful SIMD hardware support [1,2,3,4,5,7,10,14] through compiler auto-vectorization. However, the modern SIMD architecture poses new constraints such as data alignment, masking for control flow, non-unit stride access to memory, fixedlength nature of SIMD vectors, even though, there has been a huge effort being made in the past decade to address these challenges to certain extent [1,5,7,13,14], often times, the compiler automatic vectorization would still fail to vectorize application programs or fail to generate optimized SIMD code due to various reasons such as compile-time unknown loop trip count, memory access stride and patterns, alignment and control flow complexity.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…There has been a rich body of compiler research and development on exploiting SIMD parallelism for modern CPU and GPU cores with rich and powerful SIMD hardware support [1,2,3,4,5,7,10,14] through compiler auto-vectorization. However, the modern SIMD architecture poses new constraints such as data alignment, masking for control flow, non-unit stride access to memory, fixedlength nature of SIMD vectors, even though, there has been a huge effort being made in the past decade to address these challenges to certain extent [1,5,7,13,14], often times, the compiler automatic vectorization would still fail to vectorize application programs or fail to generate optimized SIMD code due to various reasons such as compile-time unknown loop trip count, memory access stride and patterns, alignment and control flow complexity.…”
Section: Related Workmentioning
confidence: 99%
“…However, the modern SIMD architecture poses new constraints such as data alignment, masking for control flow, non-unit stride access to memory, fixedlength nature of SIMD vectors, even though, there has been a huge effort being made in the past decade to address these challenges to certain extent [1,5,7,13,14], often times, the compiler automatic vectorization would still fail to vectorize application programs or fail to generate optimized SIMD code due to various reasons such as compile-time unknown loop trip count, memory access stride and patterns, alignment and control flow complexity. When that happens, the programmer would have to perform low-level SIMD intrinsic programming or write inline ASM code in order to utilize SIMD hardware resources effectively [6].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Most modern computing platforms have incorporated single-instruction multiple-data (SIMD) extensions into their processors [22] to exploit the natural parallelism of applications if the data can be SIMDized (i.e., if a single instruction can simultaneously operate on a vector of consecutive data). On Cell BE, each vector contains four floating-point numbers that are operated concurrently, and thus the ideal speedup is 4.…”
Section: Single-instruction Multiple-data Parallelismmentioning
confidence: 99%
“…It is particularly severe for SIMD units when they need to collect data from multiple strided addresses for parallel processing. Data permutations improving the data layout are an active area of research [15,13]. Intel Array Building Blocks [6] also perform dynamic layout transformations of user data for better parallel processing speed.…”
Section: Introductionmentioning
confidence: 99%