“…There has been a rich body of compiler research and development on exploiting SIMD parallelism for modern CPU and GPU cores with rich and powerful SIMD hardware support [1,2,3,4,5,7,10,14] through compiler auto-vectorization. However, the modern SIMD architecture poses new constraints such as data alignment, masking for control flow, non-unit stride access to memory, fixedlength nature of SIMD vectors, even though, there has been a huge effort being made in the past decade to address these challenges to certain extent [1,5,7,13,14], often times, the compiler automatic vectorization would still fail to vectorize application programs or fail to generate optimized SIMD code due to various reasons such as compile-time unknown loop trip count, memory access stride and patterns, alignment and control flow complexity.…”