Tobias Hilbrich scite author profile

Abstract-The widely used Message Passing Interface (MPI) is complex and rich. As a result, application developers require automated tools to avoid and to detect MPI programming errors. We present the Marmot Umpire Scalable Tool (MUST) that detects such errors with a significantly increased scalability. We present improvements to our graph-based deadlock detection approach for MPI, which cover complex MPI constructs, as well as future MPI extensions. Further, our enhancements check complex MPI constructs that no previous graph-based detection approach handled correctly. Finally, we present optimizations for the processing of MPI operations that reduce runtime deadlock detection overheads. Existing approaches could require O(p) analysis time per MPI operation, for p processes, where our improvements lead to an O(log p) complexity or better for real world applications. We present overhead measurements for two major benchmark suites with up to 1024 cores to demonstrate our improvements for real world scenarios.

show abstract

MPI Runtime Error Detection with MUST: Advances in Deadlock Detection

Hilbrich

Protze

Schulz

et al. 2013

Scientific Programming

View full text Add to dashboard Cite

The widely used Message Passing Interface (MPI) is complex and rich. As a result, application developers require automated tools to avoid and to detect MPI programming errors. We present the Marmot Umpire Scalable Tool (MUST) that detects such errors with significantly increased scalability. We present improvements to our graph-based deadlock detection approach for MPI, which cover future MPI extensions. Our enhancements also check complex MPI constructs that no previous graph-based detection approach handled correctly. Finally, we present optimizations for the processing of MPI operations that reduce runtime deadlock detection overheads. Existing approaches often require 𝒪(p) analysis time per MPI operation, forpprocesses. We empirically observe that our improvements lead to sub-linear or better analysis time per operation for a wide range of real world applications.

show abstract

MUST: A Scalable Approach to Runtime Error Detection in MPI Programs

Hilbrich

Schulz²,

Supinski³

et al. 2010

View full text Add to dashboard Cite

Abstract. The Message-Passing Interface (MPI) is large and complex. Therefore, programming MPI is error prone. Several MPI runtime correctness tools address classes of usage errors, such as deadlocks or nonportable constructs. To our knowledge none of these tools scales to more than about 100 processes. However, some of the current HPC systems use more than 100,000 cores and future systems are expected to use far more. Since errors often depend on the task count used, we need correctness tools that scale to the full system size. We present a novel framework for scalable MPI correctness tools to address this need. Our fine-grained, module-based approach supports rapid prototyping and allows correctness tools built upon it to adapt to different architectures and use cases. The design uses P n MPI to instantiate a tool from a set of individual modules. We present an overview of our design, along with first performance results for a proof of concept implementation.

show abstract

GTI: A Generic Tools Infrastructure for Event-Based Tools in Parallel Systems

Hilbrich

Müller

Supinski

et al. 2012

View full text Add to dashboard Cite

Abstract-Runtime detection of semantic errors in MPI applications supports efficient and correct large-scale application development. However, current approaches scale to at most one thousand processes and design limitations prevent increased scalability. The need for global knowledge for analyses such as type matching, and deadlock detection presents a major challenge. We present a scalable tool infrastructure -the Generic Tool Infrastructure (GTI) -that we will use to implement MPI runtime error detection tools and that applies to other use cases. GTI supports simple offloading of tool processing onto extra processes or threads and provides a tree based overlay network (TBON) for creating scalable tools that analyze global knowledge. We present its abstractions and code generation facilities that ease many hurdles in tool development, including wrapper generation, tool communication, trace reductions, and filters. GTI ultimately allows tool developers to focus on implementing tool functionality instead of the surrounding infrastructure. Further, we demonstrate that GTI supports scalable tool development through a lost message detector and a phase profiler. The former provides a more scalable implementation of important base functionality for MPI correctness checking, while the latter tool demonstrates that GTI can serve as the basis of further types of tools. Experiments with up to 2048 cores show that GTI's scalability features apply to both tools.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Tobias Hilbrich

A graph based approach for MPI deadlock detection

MPI runtime error detection with MUST: Advances in deadlock detection

MPI Runtime Error Detection with MUST: Advances in Deadlock Detection

MUST: A Scalable Approach to Runtime Error Detection in MPI Programs

GTI: A Generic Tools Infrastructure for Event-Based Tools in Parallel Systems

Contact Info

Product

Resources

About