SIMD divergence optimization through intra-warp compaction

Vaidya, Aniruddha S.; Shayesteh, Anahita; Woo, Dong Hyuk; Saharoy, Roy; Azimi, Mani

doi:10.1145/2485922.2485954

Cited by 27 publications

(8 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The main issue with predicated execution in SIMD architectures is its low energy efficiency. Measured mask density 2 is between 18-20% on typical benchmarks [23], [24], [25]. This means that sparse predicated masks are common on modern codes.…”

Section: The Divergence Control Flow Problemmentioning

confidence: 99%

“…The hard timeout policy is implicit in every scenario. In the x-axis the number of cycles for each timeout policy control flow divergence [23], [24] does. Moreover, Vaidya et al [24] also demonstrate that the true-value position inside the mask register leads to no variability in performance.…”

Section: Benchmarksmentioning

confidence: 99%

“…In the x-axis the number of cycles for each timeout policy control flow divergence [23], [24] does. Moreover, Vaidya et al [24] also demonstrate that the true-value position inside the mask register leads to no variability in performance. For the sake of clarity we will omit the combination possibilities of the true-value positions inside the mask.…”

Section: Benchmarksmentioning

confidence: 99%

“…Vaidya et al [24] propose two micro-architectural techniques to improve the performance of predicated instructions in GPUs. They rely on the fact that the VL is usually multiple of the number of hardware execution units (or ALU-width).…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Compiler-Assisted Compaction/Restoration of SIMD Instructions

Cebrian

Balem

Barredo

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Vector processors (e.g., SIMD or GPUs) are ubiquitous in high performance systems. All the supercomputers in the world exploit data-level parallelism (DLP), for example by using single instructions to operate over several data elements. Improving vector processing is therefore key for exascale computing. However, despite its potential, vector code generation and execution have significant challenges. Among these challenges, control flow divergence is one of the main performance limiting factors. Most modern vector instruction sets, including SIMD, rely on predication to support divergence control. Nevertheless, the performance and energy consumption in predicated codes is usually insensitive to the number of active elements in a predicated mask. Since the trend is that vector register size increases, the energy efficiency of exascale computing systems will become sub-optimal. This paper proposes a novel approach to improve execution efficiency in predicated vector codes, the Compiler-Assisted Compaction/Restoration (CACR) technique. Baseline CR delays predicated SIMD instructions with inactive elements, compacting active elements from instances of the same instruction of consecutive loop iterations. Compacted elements form an equivalent dense vector instruction. After executing the dense instructions, their results are restored to the original instructions. However, CR has a significant performance and energy penalty when it fails to find active elements, either due to lack of resources when unrolling or because of inter-loop dependencies. In CACR, the compiler analyzes the code looking for key information required to configure CR. Then, it passes this information to the processor via new instructions inserted in the code. This prevents CR from waiting for active elements on scenarios when it would fail to form dense instructions. Simulated results (gem5) show that CACR improves performance by up to 29% and reduces dynamic energy by up to 24.2% on average, for a a set of applications with predicated execution. The baseline CR only achieves 18.6% performance and 14% energy improvements for the same configuration and applications.

show abstract

Section: The Divergence Control Flow Problemmentioning

confidence: 99%

Section: Benchmarksmentioning

confidence: 99%

Section: Benchmarksmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Compiler-Assisted Compaction/Restoration of SIMD Instructions

Cebrian

Balem

Barredo

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…To solve the problem of low performance of parallel data calculations [12] in industrial power applications, it is proposed to add a Vector processor hardware implementation named VPU in the TS800, which can support adaptive controllers Reinforcement learning and learningbased [13] underlying algorithm requirements. This design can support FFT and IFT of 64~4096 point [14].…”

Section: Introduction Risc-vmentioning

confidence: 99%

Design of a High Performance Vector Processor Based on RISIC-V Architecture

Han,

Liu,

Zhang

et al. 2023

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

This paper proposes a high performance Vector processor based on the high performance Embedded Core which is named TS800. The TS800 is a 4-core processor based on RISC-V architecture, implements IMAFDV instruction set, supports L2 Cache, branch prediction, sequential pipeline, and dual-issue structure. The traditional CPU mainly supports Scalar calculations, or only supports Vector calculations. For applications such as image and signal processing, there are a large number of data parallel computing operations. To solve the problem of low performance of parallel data calculations in industrial power applications, it is proposed to add VPU hardware implementation in the TS800. The TS800 can support FFT algorithm, adaptive controllers Reinforcement learning and learning-based underlying algorithm requirements. In this paper, the module and data flow between each processing unit and the control circuit, that is, the hardware realization of VPU module are proposed. Large-area units such as float arithmetic, multiplication and division are multiplexed with the Scalar operator in the CPU, while the control circuit is placed in the VPU-ALU, and the area is small. Units such as arithmetic and logic operation instructions, shift operation instructions, comparison operation instructions, and permutation instruction are implemented through the VPU-ALU, which makes the overall design area smaller and the performance better. At the same time, through the fir, fft, conv, matrix, Signal Converge and variance test, it is proved that while executing the same program, the running time of the cpu only with Scalar is 1.44 to 9.55 times that of the CPU with Vector module, which can support the underlying algorithm of the adaptive controller.

show abstract