Michael Lang scite author profile

SUMMARYClustered systems have become a dominant architecture of scalable high-performance super computers. In these large-scale computers, the network performance and scalability is as critical as the computenodes speed. InfiniBand TM has become a commodity networking solution supporting the stringent latency, bandwidth and scalability requirements of these clusters. The network performance is also affected by its topology, packet routing and the communication patterns the distributed application exercises. Fattrees are the topology structures used for constructing most large clusters as they are scalable, maintain cross-bisectional-bandwidth (CBB), and are practical to build using fixed-arity switches. In this paper, we propose a fat-tree routing algorithm that provides a congestion-free, all-to-all shift pattern leveraging on the InfiniBand TM static routing capability. The algorithm supports partially populated fat-trees built with switches of arbitrary number of ports and CBB ratios. To evaluate the proposed algorithm, detailed switch and host simulation models were developed and multiple fabric topologies were run. The results of these simulations as well as measurements on real clusters show an improvement in all-to-all delay by avoiding congestion on the fabric.

show abstract

Optimizing load balancing and data-locality with data-aware scheduling

Wang

Zhou

et al. 2014

107

View full text Add to dashboard Cite

Abstract-Load balancing techniques (e.g. work stealing) are important to obtain the best performance for distributed task scheduling systems that have multiple schedulers making scheduling decisions. In work stealing, tasks are randomly migrated from heavy-loaded schedulers to idle ones. However, for data-intensive applications where tasks are dependent and task execution involves processing a large amount of data, migrating tasks blindly yields poor data-locality and incurs significant data-transferring overhead. This work improves work stealing by using both dedicated and shared queues. Tasks are organized in queues based on task data size and location. We implement our technique in MATRIX, a distributed task scheduler for many-task computing. We leverage distributed key-value store to organize and scale the task metadata, task dependency, and data-locality. We evaluate the improved work stealing technique with both applications and micro-benchmarks structured as direct acyclic graphs. Results show that the proposed data-aware work stealing technique performs well.

show abstract

A new routing scheme for Jellyfish and its performance with HPC workloads

Yuan

Mahapatra

Nienaber

et al. 2013

View full text Add to dashboard Cite

The jellyfish topology where switches are connected using a random graph has recently been proposed for large scale data-center networks. It has been shown to offer higher bisection bandwidth and better permutation throughput than the corresponding fat-tree topology with a similar cost. In this work, we propose a new routing scheme for jellyfish that out-performs existing schemes by more effectively exploiting the path diversity, and comprehensively compare the performance of jellyfish and fat-tree topologies with HPC workloads. The results indicate that both jellyfish and fattree topologies offer comparable high performance for HPC workloads on systems that can be realized by 3-level fat-trees using the current technology and the corresponding jellyfish topologies with similar costs. Fat-trees are more effective for smaller systems while jellyfish is more scalable.

show abstract

Performance characterization of a DRAM-NVM hybrid memory architecture for HPC applications using intel optane DC persistent memory modules

Patil

Ionkov

Lee

et al. 2019

View full text Add to dashboard Cite

Non-volatile, byte-addressable memory (NVM) has been introduced by Intel in the form of NVDIMMs named Intel ® Optane™ DC PMM. This memory module has the ability to persist the data stored in it without the need for power. This expands the memory hierarchy into a hybrid memory system due the differences in access latency and memory bandwidth from DRAM, which has been the predominant byte-addressable main memory technology. The Optane DC memory modules have up to 8x the capacity of DDR4 DRAM modules which can expand the byte-address space up to 6 TB per node. Many applications can now scale up the their problem size given such a memory system. We evaluate the capabilities of this DRAM-NVM hybrid memory system and its impact on High Performance Computing (HPC) applications. We characterize the Optane DC in comparison to DDR4 DRAM with a STREAM-like custom benchmark and measure the performance for HPC mini-apps like VPIC, SNAP, LULESH and AMG under different configurations of Optane DC PMMs. We find that Optane-only executions are slower in terms of execution time than DRAM-only and Memory-mode executions by a minimum of 2 to 16% for VPIC and maximum of 6x for LULESH. CCS Concepts • Computer systems organization → Heterogeneous (hybrid) systems; • Hardware → Memory and dense storage; • Computing methodologies → Massively parallel and high-performance simulations;

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Michael Lang

Entering the petaflop era: The architecture and performance of Roadrunner

Optimized InfiniBand^TM fat‐tree routing for shift all‐to‐all communication patterns

Optimizing load balancing and data-locality with data-aware scheduling

A new routing scheme for Jellyfish and its performance with HPC workloads

Performance characterization of a DRAM-NVM hybrid memory architecture for HPC applications using intel optane DC persistent memory modules

Contact Info

Product

Resources

About

Michael Lang

Entering the petaflop era: The architecture and performance of Roadrunner

Optimized InfiniBandTM fat‐tree routing for shift all‐to‐all communication patterns

Optimizing load balancing and data-locality with data-aware scheduling

A new routing scheme for Jellyfish and its performance with HPC workloads

Performance characterization of a DRAM-NVM hybrid memory architecture for HPC applications using intel optane DC persistent memory modules

Contact Info

Product

Resources

About

Optimized InfiniBand^TM fat‐tree routing for shift all‐to‐all communication patterns