Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips

Hofmann, Jan; Treibig, Jan; Hager, Georg; Wellein, Gerhard

doi:10.1145/2568058.2568068

Cited by 32 publications

(30 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This factor is reduced to around 1.55 in the case of WBP due to the much lower computational load (see Supplementary Material, Section S3.2). This is a good achievement since full exploitation of AVX instructions proves to be difficult (Hofmann et al, 2014;Treibig et al, 2013), and this factor is of similar magnitude to the highest reported so far (Mehrotra, 2012).…”

Section: Discussionsupporting

confidence: 77%

“…Among other improvements, a main distinctive feature of AVX compared to SSE is the fact that they provide support for wider vector data (256-bit), thereby doubling the number of operations at the same time (eight 32-bit single precision floating point operations) and thus pointing to a potential twofold gain in performance. However, exploitation of AVX instructions to approach this optimal speedup factor is not trivial at all and the performance most often falls short of expectations (Hofmann et al, 2014;Treibig et al, 2013).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Tomo3D 2.0 – Exploitation of Advanced Vector eXtensions (AVX) for 3D reconstruction

Agulleiro

Fernández

2015

Journal of Structural Biology

169

114

View full text Add to dashboard Cite

Section: Discussionsupporting

confidence: 77%

Section: Introductionmentioning

confidence: 99%

Tomo3D 2.0 – Exploitation of Advanced Vector eXtensions (AVX) for 3D reconstruction

Agulleiro

Fernández

2015

Journal of Structural Biology

169

114

View full text Add to dashboard Cite

“…The instruction was first implemented in Intel multicore CPUs with AVX2 on HSW. The first implementation offered a poor latency (i.e., the time until all data was placed in the vector register) and using hand-written assembly to manually load distributed data into vector registers proved to be faster than using the gather instruction in some cases [10]. Table 3 shows the gather instruction latency for both HSW and BDW.…”

Section: Gathermentioning

confidence: 99%

An Analysis of Core- and Chip-Level Architectural Features in Four Generations of Intel Server Processors

Hofmann

Hager

Wellein

et al. 2017

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. This paper presents a survey of architectural features among four generations of Intel server processors (Sandy Bridge, Ivy Bridge, Haswell, and Broadwell) with a focus on performance with floating point workloads. Starting on the core level and going down the memory hierarchy we cover instruction throughput for floating-point instructions, L1 cache, address generation capabilities, core clock speed and its limitations, L2 and L3 cache bandwidth and latency, the impact of Cluster on Die (CoD) and cache snoop modes, and the Uncore clock speed. Using microbenchmarks we study the influence of these factors on code performance. This insight can then serve as input for analytic performance models. We show that the energy efficiency of the LINPACK and HPCG benchmarks can be improved considerably by tuning the Uncore clock speed without sacrificing performance, and that the Graph500 benchmark performance may profit from a suitable choice of cache snoop mode settings.

show abstract

“…Unfortunately, gather instructions still require a base address from a GPR and do not yet support all data types. Moreover, the current implementation is slower than a simple sequence of several loads [31]. Nonetheless, we can expect that future AVX implementations will provide better support for gathers so that they can be successfully exploited in ELZAR.…”

Section: B Proposed Avx Instructionsmentioning

confidence: 98%

ELZAR: Triple Modular Redundancy Using Intel AVX (Practical Experience Report)

Kuvaiskii

Oleksenko²,

Bhatotia³

et al. 2016

2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

View full text Add to dashboard Cite

Instruction-Level Redundancy (ILR) is a well known approach to tolerate transient CPU faults. It replicates instructions in a program and inserts periodic checks to detect and correct CPU faults using majority voting, which essentially requires three copies of each instruction and leads to high performance overheads. As SIMD technology can operate simultaneously on several copies of the data, it appears to be a good candidate for decreasing these overheads. To verify this hypothesis, we propose ELZAR, a compiler framework that transforms unmodified multithreaded applications to support triple modular redundancy using Intel AVX extensions for vectorization. Our experience with several benchmark suites and real-world casestudies yields mixed results: while SIMD may be beneficial for some workloads, e.g., CPU-intensive ones with many floatingpoint operations, it exhibits higher overhead than ILR in many applications we tested. We study the sources of overheads and discuss possible improvements to Intel AVX that would lead to better performance.

show abstract

Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips

Cited by 32 publications

References 8 publications

Tomo3D 2.0 – Exploitation of Advanced Vector eXtensions (AVX) for 3D reconstruction

Tomo3D 2.0 – Exploitation of Advanced Vector eXtensions (AVX) for 3D reconstruction

An Analysis of Core- and Chip-Level Architectural Features in Four Generations of Intel Server Processors

ELZAR: Triple Modular Redundancy Using Intel AVX (Practical Experience Report)

Contact Info

Product

Resources

About