Daniel A. Faraj scite author profile

Daniel A. Faraj

4Publications

70Citation Statements Received

64Citation Statements Given

How they've been cited

121

How they cite others

Affiliations

Intel (United States), IBM (United States), Systems Technology (United States)

Publications

Order By: Most citations

PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer

Kumar

Mamidala

Faraj

et al. 2012

View full text Add to dashboard Cite

The Blue Gene/Q machine is the next generation in the line of IBM massively parallel supercomputers, designed to scale to 262144 nodes and sixteen million threads. With each BG/Q node having 68 hardware threads, hybrid programming paradigms, which use message passing among nodes and multi-threading within nodes, are ideal and will enable applications to achieve high throughput on BG/Q. With such unprecedented massive parallelism and scale, this paper is a groundbreaking effort to explore the design challenges for designing a communication library that can match and exploit such massive parallelism In particular, we present the Parallel Active Messaging Interface (PAMI) library as our BG/Q library solution to the many challenges that come with a machine at such scale. PAMI provides (1) novel techniques to partition the application communication overhead into many contexts that can be accelerated by communication threads; (2) client and context objects to support multiple and different programming paradigms; (3) lockless algorithms to speed up MPI message rate; and (4) novel techniques leveraging the new BG/Q architectural features such as the scalable atomic primitives implemented in the L2 cache, the highly parallel hardware messaging unit that supports both point-to-point and collective operations, and the collective hardware acceleration for operations such as broadcast, reduce, and allreduce. We experimented with PAMI on 2048 BG/Q nodes and the results show high messaging rates as well as low latencies and high throughputs for collective communication operations.

show abstract

Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer

Kumar

Mamidala

Heidelberger

et al. 2014

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

The Blue Gene/Q (BG/Q) machine is the latest in the line of IBM massively parallel supercomputers, designed to scale to 262,144 nodes and 16 million threads. Each BG/Q node has 68 hardware threads. Hybrid programming paradigms, which use message passing among nodes and multi-threading within nodes, enable applications to achieve high throughput on BG/Q. In this paper, we present scalable algorithms to optimize MPI collective operations by taking advantage of the various features of the BG/Q torus and collective networks. We achieve an 8 byte double-sum MPI_Allreduce latency of 10.25 ms on 1,572,864 MPI ranks. We accelerate summing of network packets with local buffers by the use of the Quad Processing SIMD unit in the BG/Q cores and executing the sums on multiple communication threads supported by the optimized communication libraries. The achieved net gain is a peak throughput of 6.3 GB/s for double-sum allreduce. We also achieve over 90% of network peak for MPI_Alltoall with 65,536 MPI ranks.

show abstract

Optimizing MPI Collectives Using Efficient Intra-node Communication Techniques over the Blue Gene/P Supercomputer

Mamidala

Faraj

Kumar

et al. 2011

View full text Add to dashboard Cite

Optimization of MPI_Allreduce on the blue Gene/Q supercomputer

Kumar

Faraj

2013

View full text Add to dashboard Cite

The IBM Blue Gene/Q supercomputer has a 5D torus network where each node is connected to ten bi-directional links. In this paper we present techniques to optimize the MPI Allreduce collective operation by building ten different edge disjoint spanning trees on the ten torus links. We accelerate summing of network packets with local buffers by the use of Quad Processing SIMD unit in the BG/Q cores and executing the sums on multiple communication threads created by the PAMI libraries. The net gain we achieve is a peak throughput of 6.3 GB/sec for double precision floating point sum allreduce, that is a speedup of 3.75x over the collective network based algorithm in the product MPI stack on BG/Q.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Daniel A. Faraj

PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer

Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer

Optimizing MPI Collectives Using Efficient Intra-node Communication Techniques over the Blue Gene/P Supercomputer

Optimization of MPI_Allreduce on the blue Gene/Q supercomputer

Contact Info

Product

Resources

About