To use or not to use the SIMD gather instruction?

Habich, Dirk; Pietrzyk, Johannes; Krause, Alexander; Hildebrandt, Juliana; Lehner, Wolfgang

doi:10.1145/3533737.3535089

Cited by 5 publications

(5 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, this guideline does not always hold, as we have experimentally shown in [21]. The outcome of our comprehensive evaluation was that SIMD registers can be populated with data elements from non-consecutive memory locations using GATHER with (almost) the same performance as with data elements from consecutive memory location using LOAD in single-threaded as well as multi-threaded environments.…”

Section: Bounce: Block Concurrent Simd Conceptmentioning

confidence: 93%

“…Thus, the performance will probably be worse compared to the state-of-the-art scaling SIMD approach. To overcome that, there are enough optimization knobs, hence we haven taken a closer look at one knob as an example, which we already evaluated in more detail in [21]. To be self-contained, we include a specific evaluation result in this article.…”

Section: Performance Of the Data Access Patternmentioning

confidence: 99%

See 1 more Smart Citation

BOUNCE: Memory-Efficient SIMD Approach for Lightweight Integer Compression

Hildebrandt

Habich

Lehner

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Integer compression plays an important role in columnar database systems to reduce the main memory footprint as well as to speedup query processing. To keep the additional computational effort of (de)compression as low as possible, the powerful Single Instruction Multiple Data (SIMD) extensions of modern CPUs are heavily applied. While a scalar compression algorithm usually compresses a block of N consecutive integers, the state-of-the-art SIMDified implementation scales the block size to k · N with k as the number of elements which could be simultaneously processed in an SIMD register. On the one hand, this scaling SIMD approach improves the performance of (de)compression but can lead to a degradation of the compression ratio compared to the scalar variant on the other hand. Within this article, we analyze this degradation effect for various integer compression algorithms and present a novel SIMD concept to overcome that effect. The core idea of our novel SIMD concept called BOUNCE is to concurrently compress k different blocks of size N within SIMD registers, guaranteeing the same compression ratio as scalar variant. As we are going to show, our proposed SIMD idea works well on various Intel CPUs and may offer a new generalized SIMD concept to optimize further algorithms.

show abstract

Section: Bounce: Block Concurrent Simd Conceptmentioning

confidence: 93%

Section: Performance Of the Data Access Patternmentioning

confidence: 99%

BOUNCE: Memory-Efficient SIMD Approach for Lightweight Integer Compression

Hildebrandt

Habich

Lehner

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Armejach et al [60] optimized stencil applications for SVE. Habich et al [194] proposed a block-striped data access pattern heavily depending on the Gather operation on GPUs to optimize the overhead of accessing non-consecutive memory locations. Parallelizing stencil applications on GPUs is strongly correlated to SIMD extensions [136], [173], [195]- [198].…”

Section: ) Stencil Applicationsmentioning

confidence: 99%

“…[283]-[285] Xeon Phi [19], [57], [72], [193], [202]- [204], [256], [267], [282], [286], [287], [287] Intel SSE family, AVX family [59], [115], [115], [151], [152], [174], [192], [194], [242], [255], [267], [282], [288] [201], [291], [292] metrics include speedup, scalability, and efficiency (ratio of achieved throughput to peak performance). Furthermore, some evaluations used bandwidth and cache-related performance counters, especially for memory-bounded applications.…”

Section: Target Platformmentioning

confidence: 99%

MIMD Programs Execution Support on SIMD Machines: A Holistic Survey

Mustafa,

Alkhasawneh,

Obeidat

et al. 2024

IEEE Access

View full text Add to dashboard Cite

The Single Instruction Multiple Data (SIMD) architecture, supported by various highperformance computing platforms, efficiently utilizes data-level parallelism. The SIMD model is used in traditional CPUs, dedicated vector systems, and accelerators such as GPUs, vector extensions, and Xeon Phi. It provides performance throughput in computation-intensive and data-parallel applications. Despite the similarity of data-processing principles between these architectures, porting various programming models between the reviewed platforms is challenging. Furthermore, enhancing the programmability of these architectures is an important feature for utilizing their emerging computing power and simplifying programming complexity. This paper reviews the basic principles of optimization techniques to run asynchronous Multiple Instruction Multiple Data (MIMD) on SIMD accelerators. It also surveys several GPU programming paradigms and application programming interfaces (APIs) and classifies these frameworks into different groups based on their criteria. In addition, a review of studies that performed a comparison of the collaborative execution of GPUs with CPUs and Xeon Phi is presented in this paper. This study will be beneficial for developers and researchers in the field of computer architecture and parallel computing of intensive scientific applications, specifically for early-stage high-performance computing researchers, to obtain a brief overview of performance optimization opportunities as well as the challenges of existing SIMD platforms.

show abstract

“…This article is an extended version of[14]. In particular, this article includes an extensive GATHER evaluation and an additional representative example from columnar database systems compared to[14].…”

mentioning

confidence: 99%

Partition-based SIMD Processing and its Application to Columnar Database Systems

et al. 2022

Self Cite

View full text Add to dashboard Cite

The Single Instruction Multiple Data (SIMD) paradigm became a core principle for optimizing query processing in columnar database systems. Until now, only the instructions are considered to be efficient enough to achieve the expected speedups, while avoiding is considered almost imperative. However, the instruction offers a very flexible way to populate SIMD registers with data elements coming from non-consecutive memory locations. As we will discuss within this article, the instruction can achieve the same performance as the instruction, if applied properly. To enable the proper usage, we outline a novel access pattern allowing fine-grained, partition-based SIMD implementations. Then, we apply this partition-based SIMD processing to two representative examples from columnar database systems to experimentally demonstrate the applicability and efficiency of our new access pattern.

show abstract

To use or not to use the SIMD gather instruction?

Cited by 5 publications

References 24 publications

BOUNCE: Memory-Efficient SIMD Approach for Lightweight Integer Compression

BOUNCE: Memory-Efficient SIMD Approach for Lightweight Integer Compression

MIMD Programs Execution Support on SIMD Machines: A Holistic Survey

Partition-based SIMD Processing and its Application to Columnar Database Systems

Contact Info

Product

Resources

About