Single-Instruction Multiple-Data Execution

Hughes, Christopher J.

doi:10.2200/s00647ed1v01y201505cac032

Cited by 24 publications

(12 citation statements)

References 68 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, the lengths of SSE, AVX/AVX2 and AVX512 are 128 bits (4 float elements), 256 bits (8 float elements) and 512 bits (16 float elements), respectively. Second, several instructions have been added, notably, gather and scatter instructions [42]. These instructions load/store data of discontinuous positions in memory.…”

Section: Cpu Microarchitectures and Simd Instruction Setsmentioning

confidence: 99%

“…Single-instruction, multiple-data (SIMD) [41] instruction sets in vector processing units are especially changed. The evolution of the SIMD instructions has taken the form of the increased vector length [42], increased number of types of instructions and decreased latency of instructions. Therefore, it is essential to use SIMD instructions effectively for extracting CPU performance.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Effective Implementation of Edge-Preserving Filtering on CPU Microarchitectures

2018

View full text Add to dashboard Cite

In this paper, we propose acceleration methods for edge-preserving filtering. The filters natively include denormalized numbers, which are defined in IEEE Standard 754. The processing of the denormalized numbers has a higher computational cost than normal numbers; thus, the computational performance of edge-preserving filtering is severely diminished. We propose approaches to prevent the occurrence of the denormalized numbers for acceleration. Moreover, we verify an effective vectorization of the edge-preserving filtering based on changes in microarchitectures of central processing units by carefully treating kernel weights. The experimental results show that the proposed methods are up to five-times faster than the straightforward implementation of bilateral filtering and non-local means filtering, while the filters maintain the high accuracy. In addition, we showed effective vectorization for each central processing unit microarchitecture. The implementation of the bilateral filter is up to 14-times faster than that of OpenCV. The proposed methods and the vectorization are practical for real-time tasks such as image editing.

show abstract

Section: Cpu Microarchitectures and Simd Instruction Setsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Effective Implementation of Edge-Preserving Filtering on CPU Microarchitectures

2018

View full text Add to dashboard Cite

show abstract

“…This paper proposes a complex variable DSSE solver that supports all features of industrial estimation, including PMU measurements. The implementation is in vectorized code, i.e., it employs loop unrolling and exploits the power of modern processors that posses single instruction multiple data (SIMD) extensions [9].…”

Section: Introductionmentioning

confidence: 99%

“…The compactness of the complex matrix expressions is translated into a computer code that is easily readable, and therefore, more compliant to maintenance and upgrades. More importantly, the framework of complex variable solution is naturally suited to the implementation on modern processors that support single instruction multiple data (SIMD) operations [9], e.g., the fused multiply-accumulate complex variable operations. The DSSE solver in this article is implemented using advanced vector extensions (AVX-2) [30], which benefits from the latest version of code vectorization.…”

Section: Introductionmentioning

confidence: 99%

Complex Variable Multi-phase Distribution System State Estimation Using Vectorized Code

Džafić

Jabr

Hrnjic

2020

Journal of Modern Power Systems and Clean Energy

View full text Add to dashboard Cite

With the advent of advanced energy management systems in distribution systems, there is a growing interest in rapid and reliable code for distribution system state estimation (DSSE) in large-scale systems. Fast DSSE methods employed in the industry are based on load scaling as they are well suited to the abundance of pseudo-measurements. Due to the paucity of real-time measurements in DSSE, phasor measurement units (PMUs) have been proposed as a potential solution to increase the estimation accuracy. However, load scaling methodologies are not extendable for exploiting PMUs. This paper proposes a high-performance DSSE method that can handle the PMUs together with all common measurement types in industrial DSSE. By using Wirtinger calculus, the method operates entirely in complex variables and employs the latest version of advanced vector extensions (AVX-2) to reap the maximum potential of computer processing units. The paper highlights the derivation of complex DSSE in matrix form, from which one can infer the implications on code reliability and maintenance. Numerical results are reported on large-scale multi-phase distribution systems, and they are contrasted with a publicly available code for DSSE in real variables. The simulation results show that loop unrolling in AVX-2 contributes about a twofold increase in the solving speed.

show abstract

“…Vectorized programming, however, requires harder constraints than parallel programming in data structures. Vendor's short SIMD architectures, such as MMX, Streaming SIMD Extensions (SSE), Advanced Vector Extensions (AVX)/AVX2, AVX-512, AltiVec, and NEON, are expected to develop rapidly, and vector lengths will become longer [4]. SIMD instruction sets are changed by the microarchitecture of the CPU.…”

mentioning

confidence: 99%

Taxonomy of Vectorization Patterns of Programming for FIR Image Filters Using Kernel Subsampling and New One

2018

View full text Add to dashboard Cite

This study examines vectorized programming for finite impulse response image filtering. Finite impulse response image filtering occupies a fundamental place in image processing, and has several approximated acceleration algorithms. However, no sophisticated method of acceleration exists for parameter adaptive filters or any other complex filter. For this case, simple subsampling with code optimization is a unique solution. Under the current Moore’s law, increases in central processing unit frequency have stopped. Moreover, the usage of more and more transistors is becoming insuperably complex due to power and thermal constraints. Most central processing units have multi-core architectures, complicated cache memories, and short vector processing units. This change has complicated vectorized programming. Therefore, we first organize vectorization patterns of vectorized programming to highlight the computing performance of central processing units by revisiting the general finite impulse response filtering. Furthermore, we propose a new vectorization pattern of vectorized programming and term it as loop vectorization. Moreover, these vectorization patterns mesh well with the acceleration method of subsampling of kernels for general finite impulse response filters. Experimental results reveal that the vectorization patterns are appropriate for general finite impulse response filtering. A new vectorization pattern with kernel subsampling is found to be effective for various filters. These include Gaussian range filtering, bilateral filtering, adaptive Gaussian filtering, randomly-kernel-subsampled Gaussian range filtering, randomly-kernel-subsampled bilateral filtering, and randomly-kernel-subsampled adaptive Gaussian filtering.

show abstract

Single-Instruction Multiple-Data Execution

Cited by 24 publications

References 68 publications

Effective Implementation of Edge-Preserving Filtering on CPU Microarchitectures

Effective Implementation of Edge-Preserving Filtering on CPU Microarchitectures

Complex Variable Multi-phase Distribution System State Estimation Using Vectorized Code

Taxonomy of Vectorization Patterns of Programming for FIR Image Filters Using Kernel Subsampling and New One

Contact Info

Product

Resources

About