Probabilistic diagnosis of performance faults in large-scale parallel applications

Laguna, Ignacio; Ahn, Dong H.; Supinski, Bronis R. de; Bagchi, Saurabh; Gamblin, Todd

doi:10.1145/2370816.2370848

Cited by 18 publications

(15 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To overcome these challenges, Laguna et al [29] used Markov models (MMs) as a compact, scalable summary of the dynamic execution history. They create states in the MM by intercepting each MPI function call, and by capturing the call stack before and after the actual call to the underlying MPI runtime (through an PMPI function call).…”

Section: Markov Models As a Scalable Summary Of Executionmentioning

confidence: 99%

See 1 more Smart Citation

Accurate application progress analysis for large-scale parallel debugging

Mitra

Laguna

Ahn

et al. 2014

Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation

Self Cite

View full text Add to dashboard Cite

Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originated. However, existing approaches for detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, PRODOMETER, which determines relative progress among parallel tasks via dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale.

show abstract

Section: Markov Models As a Scalable Summary Of Executionmentioning

confidence: 99%

“…However, they largely suffer fundamental shortcomings when they are applied to HPC applications. The most relevant dynamic technique is AUTOMADED introduced by Laguna et al [29]. It draws probabilistic inference about progress based on a coarse control-flow graph, captured as a Markov model, that is generated through dynamic traces.…”

Section: Introductionmentioning

confidence: 99%

Accurate application progress analysis for large-scale parallel debugging

Mitra

Laguna

Ahn

et al. 2014

Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation

Self Cite

View full text Add to dashboard Cite

show abstract

“…In [41], logging enhancements in software code helped improve the diagnosis of software errors. In [4], a tool called ProgressDependence Inference that provided insights into the software code segments which caused a failure was presented. However, access to source codes may be restricted by the data centre policies.…”

Section: Related Workmentioning

confidence: 99%

“…Both local component failures and the interaction protocols among system components can, particularly under heavy load, induce system faults and failures. Several studies of large clusters have shown that system failures [1], [2] and the manifestation of faults [3], [4] can be significant problems. These problems occur on single and multiple sources (the nodes) and during the times during which the cluster system is in production.…”

Section: Introductionmentioning

confidence: 99%

Linking Resource Usage Anomalies with System Failures from Cluster Log Data

Chuah¹,

Jhumka

Narasimhamurthy³

et al. 2013

2013 IEEE 32nd International Symposium on Reliable Distributed Systems

View full text Add to dashboard Cite

Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed TACC Stats resource use monitor provides the required resource use data. This paper presents the ANCOR diagnostics system that applies TACC Stats data to identify resource use anomalies and applies log analysis to link resource use anomalies with system failures. Application of ANCOR to first identify multiple sources of resource anomalies on the Ranger supercomputer, then correlate them with failures recorded in the message logs and diagnosing the cause of the failures, has identified four new causes of compute node soft lockups. ANCOR can be adapted to any system that uses a resource use monitor which resolves resource use by job.

show abstract

“…Currently, programmers rely on arduous manual processes and spend days trying to reproduce rarelyoccurring concurrency bugs at large scale [11]. These processes often involve ad hoc techniques, such as changing process counts or configurations, choosing different compiler optimizations, or modulating the absolute times at which concurrent events are issued, and are largely ineffective.…”

Section: The Pruner Toolsetmentioning

confidence: 99%

Overcoming extreme-scale reproducibility challenges through a unified, targeted, and multilevel toolset

Ahn

Lee

Gopalakrishnan

et al. 2013

Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science A

Self Cite

View full text Add to dashboard Cite

Reproducibility, the ability to repeat program executions with the same numerical result or code behavior, is crucial for computational science and engineering applications. However, non-determinism in concurrency scheduling often hampers achieving this ability on high performance computing (HPC) systems. To aid in managing the adverse effects of non-determinism, prior work has provided techniques to achieve bit-precise reproducibility, but most of them focus only on small-scale parallelism. While scalable techniques recently emerged, they are disparate and target special purposes, e.g., single-schedule domains. On current systems with O(10 6 ) compute cores and future ones with O(10 9 ), any technique that does not embrace a unified, targeted, and multilevel approach will fall short of providing reproducibility. In this paper, we argue for a common toolset that embodies this approach, where programmers select and compose complementary tools and can effectively, yet scalably, analyze, control, and eliminate sources of non-determinism at scale. This allows users to gain reproducibility only to the levels demanded by specific code development needs. We present our research agenda and ongoing work toward this goal.

show abstract

Probabilistic diagnosis of performance faults in large-scale parallel applications

Cited by 18 publications

References 33 publications

Accurate application progress analysis for large-scale parallel debugging

Accurate application progress analysis for large-scale parallel debugging

Linking Resource Usage Anomalies with System Failures from Cluster Log Data

Overcoming extreme-scale reproducibility challenges through a unified, targeted, and multilevel toolset

Contact Info

Product

Resources

About