Efficient programming of general purpose many-core accelerators poses several challenging problems. The high number of cores available, the peculiarity of the interconnection network, and the complex memory hierarchy organization, all contribute to make efficient programming of such devices difficult. We propose to use parallel design patterns, implemented using algorithmic skeletons, to abstract and hide most of the difficulties related to the efficient programming of many-core accelerators. In particular, we discuss the porting of the FastFlow framework on the Tilera TilePro64 architecture and the results obtained running synthetic benchmarks as well as true application kernels. These results demonstrate the efficiency achieved while using patterns on the TilePro64 both to program stand-alone skeleton-based parallel applications and to accelerate existing sequential code
Composition of Map and Reduce algorithmic skeletons have been widely studied at the end of the last century and it has demonstrated effective on a wide class of problems. We recall the theoretical results motivating the introduction of these skeletons, then we discuss an experiment implementing three algorithmic skeletons, a map, a reduce and an optimized composition of a map followed by a reduce skeleton (map+reduce). The map+reduce skeleton implemented computes the same kind of problems computed by Google MapReduce, but the data flow through the skeleton is streamed rather than relying on already distributed (and possibly quite large) data items. We discuss the implementation of the three skeletons on top of ProActive/GCM in the MareMare prototype and we present some experimental obtained on a COTS cluster.
On the road to computer systems able to support the requirements of exascale applications, Chip Multi-Processors (CMPs) are equipped with an ever increasing number of cores interconnected through fast on-chip networks. To exploit such new architectures, the parallel software must be able to scale almost linearly with the number of cores available. To this end, the overhead introduced by the run-time system of parallel programming frameworks and by the architecture itself must be small enough in order to enable high scalability also for very fine-grained parallel programs. An approach to reduce this overhead is to use non-conventional architectural mechanisms revealing useful when certain concurrency patterns in the running application are statically or dynamically recognized. Following this idea, this paper proposes a run-time support able to reduce the effective latency of inter-thread cooperation primitives by lowering the contention on individual caches. To achieve this goal, the new home-forwarding hardware mechanism is proposed and used by our runtime in order to reduce the amount of cache-to-cache interactions generated by the cache coherence protocol. Our ideas have been emulated on the Tilera TILEPro64 CMP, showing a significant speedup improvement in some first benchmarks
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.