SReplay

Qian, Xuehai; Sen, Koushik; Hargrove, Paul; Iancu, Costin

doi:10.1145/2925426.2926264

Cited by 4 publications

(2 citation statements)

References 60 publications

(60 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…State-of-the-art record-and-replay tools such as ReMPI (Sato et al, 2015) target production-scale runs and prioritize scalability in terms of runtime and record size. Other record-and-replay tools target hybrid MPI + OpenMP executions (Budanur et al, 2012), MPI applications using one-sided communication (Qian et al, 2016b,a), replay of isolated subgroups of processes (Xue et al, 2009), and probabilistic replay (Park et al, 2009). In addition, tools such as NINJA (Sato et al, 2017) are used in conjunction with record-and-replay tools to improve the chances of capturing nondeterministic bugs.…”

Section: Existing Graph Algorithms In Hpc and Unaddressed Needs In No...mentioning

confidence: 99%

A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance Computing

Bhowmick

Bell

Taufer

2023

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

The convergence of extremely high levels of hardware concurrency and the effective overlap of computation and communication in asynchronous executions has resulted in increasing nondeterminism in High-Performance Computing (HPC) applications. Nondeterminism can manifest at multiple levels: from low-level communication primitives to libraries to application-level functions. No matter its source, nondeterminism can drastically increase the cost of result reproducibility, debugging workflows, testing parallel programs, or ensuring fault-tolerance. Nondeterministic executions of HPC applications can be modeled as event graphs, and the applications’ nondeterministic behavior can be understood and, in some cases, mitigated using graph comparison algorithms. However, a connection between graph comparison algorithms and approaches to understanding nondeterminism in HPC still needs to be established. This survey article moves the first steps toward establishing a connection between graph comparison algorithms and nondeterminism in HPC with its three contributions: it provides a survey of different graph comparison algorithms and a timeline for each category’s significant works; it discusses how existing graph comparison methods do not fully support properties needed to understand nondeterministic patterns in HPC applications; and it presents the open challenges that should be addressed to leverage the power of graph comparisons for the study of nondeterminism in HPC applications.

show abstract

Section: Existing Graph Algorithms In Hpc and Unaddressed Needs In No...mentioning

confidence: 99%

A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance Computing

Bhowmick

Bell

Taufer

2023

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

show abstract

“…MPI's onesided communication routines in particular pose unique challenges to R&R. Quian et al proposed two techniques for addressing this challenge-OPR [42] and its successor SReplay [43]. SReplay proposes a hybrid-replay scheme which permits replay of subgroups of processes.…”

Section: Debugging-centric Techniquesmentioning

confidence: 99%

Record-and-Replay Techniques for HPC Systems: A Survey

Chapp

Sato

Ahn

et al. 2018

JSFI

View full text Add to dashboard Cite

Record-and-replay techniques provide the ability to record executions of nondeterministic applications and re-execute them identically. These techniques find use in the contexts of debugging, reproducibility, and fault-tolerance, especially in the presence of nondeterministic factors such as message races. Record-and-replay techniques are highly diverse in terms of the fidelity of replay they provide, the assumptions they make about the recorded application, the programming models they target, and the runtime overheads they impose. In the high performance computing (HPC) environment, all the above factors must be considered in concert, thus presenting additional implementation challenges. In this manuscript, we survey record-and-replay techniques in terms of the programming models they target and the workloads on which they were evaluated, providing a categorization of these techniques benefiting application developers and researchers targeting exascale challenges. This manuscript answers three questions through this survey: What are the gaps in the existing space of record-and-replay techniques? What is the roadmap to widespread use of record-and-replay on production-scale HPC workloads? And, what are the critical open problems that must be addressed to make record-and-replay viable at exascale?

show abstract

Debugging MPI Implementations via Reduction-to-Primitives

Cooperman

Zhao

2022

2022 IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck)

View full text Add to dashboard Cite

SReplay

Cited by 4 publications

References 60 publications

A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance Computing

A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance Computing

Record-and-Replay Techniques for HPC Systems: A Survey

Debugging MPI Implementations via Reduction-to-Primitives

Contact Info

Product

Resources

About