Ely Levy scite author profile

Ben-Nun

et al. 2010

Abstract-Heterogeneous systems provide new opportunities to increase the performance of parallel applications on clusters with CPU and GPU architectures. Currently, applications that utilize GPU devices run their device-executable code on local devices in their respective hosting-nodes. This paper presents a package for running OpenMP, C++ and unmodified OpenCL applications on clusters with many GPU devices. This Many GPUs Package (MGP) includes an implementation of the OpenCL specifications and extensions of the OpenMP API that allow applications on one hosting-node to transparently utilize cluster-wide devices (CPUs and/or GPUs). MGP provides means for reducing the complexity of programming and running parallel applications on clusters, including scheduling based on task dependencies and buffer management. The paper presents MGP and the performance of its internals.

show abstract

An On-line Algorithm for Fair-Share Node Allocations in a Cluster

Amar

et al. 2007

Resilient gossip algorithms for collecting online management information in exascale clusters

Drezner

Concurrency and Computation

et al. 2015

Management of forthcoming exascale clusters requires frequent collection of run-time information about the nodes and the running applications. This paper presents a new paradigm for providing online information to the management system of scalable clusters, consisting of a large number of nodes and one or more masters that manage these nodes. We describe the details of resilient gossip algorithms for sharing local information within subsets of nodes and for sending global information to a master, which holds information on all the nodes. The presented algorithms are decentralized, scalable and resilient, working well even when some nodes fail, without needing any recovery protocol. The paper gives formal expressions for approximating the average ages of the local information at each node and the information collected by the master. It then shows that these results closely match the results of simulations and measurements on a real cluster. The paper also investigates the resilience of the algorithms and the impact on the average age when nodes or masters fail. The main outcome of this paper is that partitioning of large clusters can improve the quality of information available to the management system without increasing the number of messages per node.In the following algorithm, colony nodes share information and push (send) global windows of information to the master:Push algorithm -colonies send information to the master: At a fixed point every unit of time, each colony node:Updates its vector and immediately sends a local window with all its vector entries whose current age does not exceed T to another node in its colony, chosen randomly with a uniform distribution. When a colony node receives a local window it:Adjusts the window for network latency. Replaces each vector entry with the received window entry, if the latter is newer. Registers the arrival time in the replaced vector entries, using its local clock. With probability k n (k is the intended average update rate), updates its vector and sends a global window to the master. When the master receives a global window it:Adjusts the window for network latency. Registers the window's arrival time on all the received entries using its local clock. Updates all its entries with the latest received window entries, if the latter is newer. The pull algorithmIn this algorithm, colony nodes share information, while the master regularly pulls (requests) global windows of information from one or a few randomly selected nodes in each colony:Pull algorithm -master requests information from each colony: At a fixed point every unit of time, each colony node:Updates its vector and immediately sends a local window with all its vector entries whose current age does not exceed T to another node in its colony, chosen randomly with a uniform distribution. When a colony node receives a local window it:Adjusts the window for network latency. Replaces each vector entry with the received window entry, if the latter is newer. Registers the arrival time in the replaced vector entrie...

show abstract

Maps

Rubin

ACM Trans. Archit. Code Optim.

et al. 2014

GPUs play an increasingly important role in high-performance computing. While developing naive code is straightforward, optimizing massively parallel applications requires deep understanding of the underlying architecture. The developer must struggle with complex index calculations and manual memory transfers. This article classifies memory access patterns used in most parallel algorithms, based on Berkeley's Parallel "Dwarfs." It then proposes the MAPS framework, a device-level memory abstraction that facilitates memory access on GPUs, alleviating complex indexing using on-device containers and iterators. This article presents an implementation of MAPS and shows that its performance is comparable to carefully optimized implementations of real-world applications.

show abstract

Memory access patterns

Ben-Nun

et al. 2015