Writing Parallel Libraries with MPI - Common Practice, Issues, and Extensions

Hoefler, Torsten; Snir, Marc

doi:10.1007/978-3-642-24449-0_45

Cited by 7 publications

(3 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the contrary, we show that the full potential of partitioning and advanced topology mapping can be provided "under the hood". Our library follows the guidelines for good MPI library design [36] and completely hides all communication and data-distribution functions from the user. Thus, it enables highest performance portability across a wide variety of architectures and arbitrary network topologies.…”

Section: Discussionmentioning

confidence: 99%

Productive Parallel Linear Algebra Programming with Unstructured Topology Adaption

Gottschling¹,

Hoefler

2012

2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012)

Self Cite

View full text Add to dashboard Cite

Abstract-Sparse linear algebra is a key component of many scientific computations such as computational fluid dynamics, mechanical engineering or the design of new materials to mention only a few. The discretization of complex geometries in unstructured meshes leads to sparse matrices with irregular patterns. Their distribution in turn results in irregular communication patterns within parallel operations.In this paper, we show how sparse linear algebra can be implemented effortless on distributed memory architectures. We demonstrate how simple it is to incorporate advanced partitioning, network topology mapping, and data migration techniques into parallel HPC programs by establishing novel abstractions.For this purpose, we developed a linear algebra libraryParallel Matrix Template Library 4 -based on generic and meta-programming introducing a new paradigm: meta-tuning. The library establishes its own domain-specific language embedded in C ++ . The simplicity of software development is not paid by lower performance. Moreover, the incorporation of topology mapping demonstrated performance improvements up to 29 %. I. MOTIVATIONMany scientific simulations, such as computational fluid dynamics, mechanical engineering or the design of new materials use computations on unstructured grids as their core method ( §II-A). The operations are expressed as linear algebra (LA) with sparse matrices. These matrices are very often unstructured, that is, the distribution of non-zero values and the data dependencies of typical operations, such as matrixvector multiplication, are irregular.Many large-scale scientific HPC applications can highly benefit from specialized data structures and domain-specific algorithms operating on them. On the other hand, strongly specialized implementations are very expensive to expand for new algorithms and new data structures.The introduction of PETSc [1] in the 90s provided reusable algorithms and data structures for many applications leading to a significant increase of productivity in scientific software development. We aim to raise the productivity further with techniques that did not exist yet at the time PETSc was created.The goal is that the linear algebra library adapts itself to the scientific application instead of applications designed around libraries. Such adaption can be achieved thanks to the expressiveness and efficiency of the template system of C ++ [2] [4]. In this work, we focus on the last two, distributing the unstructured matrices and mapping the resulting communication graph to the network topology. Ideally, these tasks are performed without user assistance leading to convenient libraries that allow developers to program with intuitive abstractions but without sacrificing performance ( §II-B-II-F).Domain-decomposition techniques for structured and unstructured grids have been intensively analyzed and libraries that provide good decompositions are ready for use. In contrast to it, mapping those unstructured grids and their according irregular communication topologies onto sta...

show abstract

Section: Discussionmentioning

confidence: 99%

Productive Parallel Linear Algebra Programming with Unstructured Topology Adaption

Gottschling¹,

Hoefler

2012

2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012)

Self Cite

View full text Add to dashboard Cite

show abstract

“…MPI works on the principle that nothing is shared between processes unless it is explicitly transported by the programmer. These semantics simplify reasoning about the program's state (Hoefler & Snir, 2011) and avoid complex problems that are often encountered in shared-memory programming models (Lee, 2006) where automatic memory synchronization becomes a significant bottleneck.…”

Section: Related Workmentioning

confidence: 99%

Implementing generalized deep-copy in MPI

Whittle

Borgo

Jones

2016

PeerJ Computer Science

View full text Add to dashboard Cite

In this paper, we introduce a framework for implementing deep copy on top of MPI. The process is initiated by passing just the root object of the dynamic data structure. Our framework takes care of all pointer traversal, communication, copying and reconstruction on receiving nodes. The benefit of our approach is that MPI users can deep copy complex dynamic data structures without the need to write bespoke communication or serialize/deserialize methods for each object. These methods can present a challenging implementation problem that can quickly become unwieldy to maintain when working with complex structured data. This paper demonstrates our generic implementation, which encapsulates both approaches. We analyze the approach with a variety of structures (trees, graphs (including complete graphs) and rings) and demonstrate that it performs comparably to hand written implementations, using a vastly simplified programming interface. We make the source code available completely as a convenient header file.

show abstract

“…Its shared nothing semantics and the SPMD programming simplify reasoning about the program's state and avoid complex problems that are often encountered in shared memory programming models [10]. Composition is achieved through communication contexts (called communicators in MPI) that enable multiple parallel libraries or objects to be combined into a single program without interference [8]. Those features have made MPI the predominant programming model for parallel scientific applications.…”

Section: Introductionmentioning

confidence: 99%

Ownership passing

et al. 2013

Self Cite

View full text Add to dashboard Cite

The number of cores in multi-and many-core high-performance processors is steadily increasing. MPI, the de-facto standard for programming high-performance computing systems offers a distributed memory programming model. MPI's semantics force a copy from one process' send buffer to another process' receive buffer. This makes it difficult to achieve the same performance on modern hardware than shared memory programs which are arguably harder to maintain and debug. We propose generalizing MPI's communication model to include ownership passing, which make it possible to fully leverage the shared memory hardware of multi-and many-core CPUs to stream communicated data concurrently with the receiver's computations on it. The benefits and simplicity of message passing are retained by extending MPI with calls to send (pass) ownership of memory regions, instead of their contents, between processes. Ownership passing is achieved with a hybrid MPI implementation that runs MPI processes as threads and is mostly transparent to the user. We propose an API and a static analysis technique to transform legacy MPI codes automatically and transparently to the programmer, demonstrating that this scheme is easy to use in practice. Using the ownership passing technique, we see up to 51% communication speedups over a standard message passing implementation on state-of-the art multicore systems. Our analysis and interface will lay the groundwork for future development of MPI-aware optimizing compilers and multi-core specific optimizations, which will be key for success in current and nextgeneration computing platforms.

show abstract

Writing Parallel Libraries with MPI - Common Practice, Issues, and Extensions

Cited by 7 publications

References 19 publications

Productive Parallel Linear Algebra Programming with Unstructured Topology Adaption

Productive Parallel Linear Algebra Programming with Unstructured Topology Adaption

Implementing generalized deep-copy in MPI

Ownership passing

Contact Info

Product

Resources

About