2013 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum 2013
DOI: 10.1109/ipdpsw.2013.231
|View full text |Cite
|
Sign up to set email alerts
|

Compiler-Based Data Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
28
1

Year Published

2014
2014
2017
2017

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 35 publications
(29 citation statements)
references
References 18 publications
0
28
1
Order By: Relevance
“…Furthermore, to exploit the Phi's peak performance, applications must fully utilize all cores and their VPUs by keeping them busy throughout the execution cycle. The data has to be ready at the cores' disposal without delivery delays [29].…”
Section: Accepted Manuscriptmentioning
confidence: 99%
“…Furthermore, to exploit the Phi's peak performance, applications must fully utilize all cores and their VPUs by keeping them busy throughout the execution cycle. The data has to be ready at the cores' disposal without delivery delays [29].…”
Section: Accepted Manuscriptmentioning
confidence: 99%
“…The delay can be introduced with the mm delay32 intrinsic, as shown in Listing 1.2, the parameter specifies the number of idle cycles. Streaming Stores Streaming stores can reduce barrier overhead [11] when storing notification values to flags in the "Notification Phase" or reinitializing counters in the combining tree or sense-reversing centralized barrier. Listing 1.3 details the implementation of the mic_store function.…”
Section: Hybrid Barriermentioning
confidence: 99%
“…These instructions are hints to the cache hardware, and will not trigger page faults or exceptions if the address supplied is non resident or protected. In [28] a sophisticated strategy is given for making use of these.…”
Section: Pre-fetchingmentioning
confidence: 99%
“…In image processing it cannot be assumed that memory fetches for vector operations will be aligned on 64 byte boundaries as expected by the Intel compilers described in [28]. Image processing routinely involves operations being performed between sub-windows, placed at arbitrary origins within an image.…”
Section: Pre-fetchingmentioning
confidence: 99%