Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing 2014
DOI: 10.1145/2568058.2568068
|View full text |Cite
|
Sign up to set email alerts
|

Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips

Abstract: Single Instruction, Multiple Data (SIMD) vectorization is a major driver of performance in current architectures, and is mandatory for achieving good performance with codes that are limited by instruction throughput. We investigate the efficiency of different SIMDvectorized implementations of the RabbitCT benchmark. RabbitCT performs 3D image reconstruction by back projection, a vital operation in computed tomography applications. The underlying algorithm is a challenge for vectorization because it consists, a… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
28
0
1

Year Published

2015
2015
2018
2018

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 32 publications
(30 citation statements)
references
References 8 publications
1
28
0
1
Order By: Relevance
“…This factor is reduced to around 1.55 in the case of WBP due to the much lower computational load (see Supplementary Material, Section S3.2). This is a good achievement since full exploitation of AVX instructions proves to be difficult (Hofmann et al, 2014;Treibig et al, 2013), and this factor is of similar magnitude to the highest reported so far (Mehrotra, 2012).…”
Section: Discussionsupporting
confidence: 77%
See 1 more Smart Citation
“…This factor is reduced to around 1.55 in the case of WBP due to the much lower computational load (see Supplementary Material, Section S3.2). This is a good achievement since full exploitation of AVX instructions proves to be difficult (Hofmann et al, 2014;Treibig et al, 2013), and this factor is of similar magnitude to the highest reported so far (Mehrotra, 2012).…”
Section: Discussionsupporting
confidence: 77%
“…Among other improvements, a main distinctive feature of AVX compared to SSE is the fact that they provide support for wider vector data (256-bit), thereby doubling the number of operations at the same time (eight 32-bit single precision floating point operations) and thus pointing to a potential twofold gain in performance. However, exploitation of AVX instructions to approach this optimal speedup factor is not trivial at all and the performance most often falls short of expectations (Hofmann et al, 2014;Treibig et al, 2013).…”
Section: Introductionmentioning
confidence: 99%
“…The instruction was first implemented in Intel multicore CPUs with AVX2 on HSW. The first implementation offered a poor latency (i.e., the time until all data was placed in the vector register) and using hand-written assembly to manually load distributed data into vector registers proved to be faster than using the gather instruction in some cases [10]. Table 3 shows the gather instruction latency for both HSW and BDW.…”
Section: Gathermentioning
confidence: 99%
“…Unfortunately, gather instructions still require a base address from a GPR and do not yet support all data types. Moreover, the current implementation is slower than a simple sequence of several loads [31]. Nonetheless, we can expect that future AVX implementations will provide better support for gathers so that they can be successfully exploited in ELZAR.…”
Section: B Proposed Avx Instructionsmentioning
confidence: 98%