Towards a First Vertical Prototyping of an Extremely Fine-Grained Parallel Programming Approach

Naishlos, Dorit; Nuzman, Joseph; Tseng, Chau‐Wen; Vishkin, Uzi

doi:10.1007/s00224-003-1086-6

Cited by 22 publications

(38 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To evaluate the RAP algorithm, we are using XMT 5 -a general-purpose manycore architecture [23]. A recent study showed that when configured to use the same chip area, XMT can outperform both an Intel Core 2 (speedups up to 13.83x [6]), AMD Opteron (speedups up to 8.56x [36]) and also an NVIDIA GTX280 GPU (speedups of up to 8.10x [5] on irregular workloads).…”

Section: Fig 1 Miss Handling Architecture (Mha) For a Banked Cache mentioning

confidence: 99%

“…The compiler can insert several short independent work units (or tasks) in a loop within a coarser task, effectively enabling the use of loop prefetching, at the possible cost of a less load-balanced execution. This compiler technique, called thread clustering [23], allowed us to evaluate the loop prefetching algorithm on all our benchmarks.…”

Section: Additional Optimizationsmentioning

confidence: 99%

“…[23]) is improving single-task performance through parallelism. XMT was designed from the ground up to capitalize on the huge on-chip resources becoming available with new fabrication technologies.…”

Section: The Xmt Frameworkmentioning

confidence: 99%

See 2 more Smart Citations

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Caragea

Tzannes

Keceli

et al. 2011

Int J Parallel Prog

Self Cite

View full text Add to dashboard Cite

Abstract. Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi-and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor numerous lighter cores with less resources, further reducing support for MLP on a per-core basis. Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% in run-time on average across benchmarks and the state-of-the art GCC implementation by up to 34.79%, depending upon hardware configuration. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%. To demonstrate the robustness of our approach, we conduct a designspace exploration (DSE) for the considered target architecture by varying (i) the amount of chip resources designated for per-core prefetch storage and (ii) off-chip bandwidth. We show that the RAP algorithm is robust in that it improves performance across all design points considered. We also identify the Pareto-optimal hardware-software configuration which delivers 53.66% run-time improvement on average while using only 5.47% more chip area than the bare-bones design.

show abstract

Section: Fig 1 Miss Handling Architecture (Mha) For a Banked Cache mentioning

confidence: 99%

Section: Additional Optimizationsmentioning

confidence: 99%

See 1 more Smart Citation

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Caragea

Tzannes

Keceli

et al. 2011

Int J Parallel Prog

Self Cite

View full text Add to dashboard Cite

show abstract

“…We assume uniform traffic pattern, which is expected for the memory architecture described in [16], due to the use of a hashing mechanism [2,4,10,15].…”

Section: Cycle-accurate Validationmentioning

confidence: 99%

“…The XMT architecture eliminates local private caches in order to avoid cache coherence issues and uses hashing mechanism to avoid hot spots [16]. This dramatically increases the load on the interconnection network and makes the network traffic reasonably uniform, rendering the current interconnection networks ineffective.…”

Section: Impact On Single-chip Parallel Processingmentioning

confidence: 99%