FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking

Chen, Zhezhe; Gao, Qi; Zhang, Wenbin; Qin, Feng

doi:10.1109/sc.2010.27

Cited by 22 publications

(10 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Instead of instrumenting each memory access in the MPI library, Profiler tracks data movement operations such as memory copy and network send/receive. This will not affect the detection capability of SyncChecker because the underlying MPI libraries often exploit such coarsegrained operations for transferring messages, i.e., copying out message to an intermediate memory location or directly sending message over the network [24], [62].…”

Section: B Profiler: Collecting Runtime Informationmentioning

confidence: 99%

“…If no intersection is found, Analyzer simply discards the events of data movements since they are irrelevant to nonblocking communication. Similar technique has been used in our prior work [24], [62]. Otherwise, Analyzer performs the state transition for the identified message buffer based on the error detection state machine in Figure 3.…”

Section: Memory Access Instructions and Memory Management Routinesmentioning

confidence: 99%

See 1 more Smart Citation

SyncChecker: Detecting Synchronization Errors between MPI Applications and Libraries

Chen

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

We have implemented a prototype of SyncChecker on Linux and evaluated it with seven bug cases, i.e., five introduced by the original developers and two injected, in four different MPI applications. Our experiments show that SyncChecker detects all the evaluated synchronization errors and provides helpful diagnostic information. Moreover, our experiments with seven NAS Parallel Benchmarks demonstrate that SyncChecker incurs moderate runtime overhead, 1.3-9.5 times with an average of 5.2 times, making it suitable for software testing.

show abstract

Section: B Profiler: Collecting Runtime Informationmentioning

confidence: 99%

Section: Memory Access Instructions and Memory Management Routinesmentioning

confidence: 99%

SyncChecker: Detecting Synchronization Errors between MPI Applications and Libraries

Chen

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

show abstract

“…As the bug degrades the performance of Allgather but no deadlock is produced, those techniques targeted at temporal progress [5] will not work either. Finally, since there is no break in the message flow of Allgather as all messages are delivered eventually but with a suboptimal algorithm, FlowChecker [12] will not be able to detect this bug. Therefore, Vrisha is a good complement to these existing techniques for detecting subtle scale-dependent bugs in parallel programs.…”

Section: Comparison With Previous Techniquesmentioning

confidence: 99%

“…With respect to bug localization, the requirement is to localize the bug to as small a portion of the code as possible so that the developer can correct the bug. These two motivations have spurred a significant volume of work in the HPC community, with a spurt being observable in the last five years [5,21,23,11,10,17,12]. Unlike prior work, we focus on bugs that manifest as software is scaled up.…”

Section: Introductionmentioning

confidence: 99%

Vrisha

Zhou

Kulkarni

Bagchi

2011

Proceedings of the 20th International Symposium on High Performance Distributed Computing

View full text Add to dashboard Cite

Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while smallscale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional statistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs caused by errors in popular MPI libraries and show that our techniques can be implemented with low overhead and low false-positive rates.

show abstract

“…Thus, debuggers are typically restricted to techniques that can be executed sequentially on the front-end node in a reasonable time. Recently, there are notable works, which focus on formal and semi-formal verification of MPI concurrency and message flow checking [23] [24]. However, we only focus on the challenges addressed above.…”

Section: Introductionmentioning

confidence: 99%

Assertion Based Parallel Debugging

Dinh

Abramson

Kurniawan

et al. 2011

2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

View full text Add to dashboard Cite

Abstract-Programminglanguages have advanced tremendously over the years, but program debuggers have hardly changed. Sequential debuggers do little more than allow a user to control the flow of a program and examine its state. Parallel ones support the same operations on multiple processes, which are adequate with a small number of processors, but become unwieldy and ineffective on very large machines. Typical scientific codes have enormous multidimensional data structures and it is impractical to expect a user to view the data using traditional display techniques. In this paper we discuss the use of debug-time assertions, and show that these can be used to debug parallel programs. The techniques reduce the debugging complexity because they reason about the state of large arrays without requiring the user to know the expected value of every element. Assertions can be expensive to evaluate, but their performance can be improved by running them in parallel. We demonstrate the system with a case study finding errors in a parallel version of the Shallow Water Equations, and evaluate the performance of the tool on a 4,096 cores Cray XE6.

show abstract

FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking

Cited by 22 publications

References 49 publications

SyncChecker: Detecting Synchronization Errors between MPI Applications and Libraries

SyncChecker: Detecting Synchronization Errors between MPI Applications and Libraries

Vrisha

Assertion Based Parallel Debugging

Contact Info

Product

Resources

About