Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques 2012
DOI: 10.1145/2370816.2370848
|View full text |Cite
|
Sign up to set email alerts
|

Probabilistic diagnosis of performance faults in large-scale parallel applications

Abstract: Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control but little information about the causes of failures. Most debuggers also scale poorly despite continued growth in supercomputer core counts. Our novel, highly scalable tool helps developers to understand and to fix performance failures and correctness problems at scale. Our tool probabilistically infers the least progressed task in MPI programs using Markov models of execution history and… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2013
2013
2018
2018

Publication Types

Select...
8

Relationship

2
6

Authors

Journals

citations
Cited by 18 publications
(15 citation statements)
references
References 33 publications
0
15
0
Order By: Relevance
“…To overcome these challenges, Laguna et al [29] used Markov models (MMs) as a compact, scalable summary of the dynamic execution history. They create states in the MM by intercepting each MPI function call, and by capturing the call stack before and after the actual call to the underlying MPI runtime (through an PMPI function call).…”
Section: Markov Models As a Scalable Summary Of Executionmentioning
confidence: 99%
See 1 more Smart Citation
“…To overcome these challenges, Laguna et al [29] used Markov models (MMs) as a compact, scalable summary of the dynamic execution history. They create states in the MM by intercepting each MPI function call, and by capturing the call stack before and after the actual call to the underlying MPI runtime (through an PMPI function call).…”
Section: Markov Models As a Scalable Summary Of Executionmentioning
confidence: 99%
“…However, they largely suffer fundamental shortcomings when they are applied to HPC applications. The most relevant dynamic technique is AUTOMADED introduced by Laguna et al [29]. It draws probabilistic inference about progress based on a coarse control-flow graph, captured as a Markov model, that is generated through dynamic traces.…”
Section: Introductionmentioning
confidence: 99%
“…In [41], logging enhancements in software code helped improve the diagnosis of software errors. In [4], a tool called ProgressDependence Inference that provided insights into the software code segments which caused a failure was presented. However, access to source codes may be restricted by the data centre policies.…”
Section: Related Workmentioning
confidence: 99%
“…Both local component failures and the interaction protocols among system components can, particularly under heavy load, induce system faults and failures. Several studies of large clusters have shown that system failures [1], [2] and the manifestation of faults [3], [4] can be significant problems. These problems occur on single and multiple sources (the nodes) and during the times during which the cluster system is in production.…”
Section: Introductionmentioning
confidence: 99%
“…Currently, programmers rely on arduous manual processes and spend days trying to reproduce rarelyoccurring concurrency bugs at large scale [11]. These processes often involve ad hoc techniques, such as changing process counts or configurations, choosing different compiler optimizations, or modulating the absolute times at which concurrent events are issued, and are largely ineffective.…”
Section: The Pruner Toolsetmentioning
confidence: 99%