Exposing errors related to weak memory in GPU applications

Sorensen, Tyler; Donaldson, Alastair F.

doi:10.1145/2908080.2908114

Cited by 30 publications

(11 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Fence-SC order can prevent weak memory ordering behaviors that acquire/release alone cannot prevent, such as the well-known store buffering (SB) pattern of Figure 6. The introduction of fence.sc in the newest generation of PTX corrects the weak SB behavior seen with membar in previous NVIDIA GPU architectures [51] ( §9.7.12.3).…”

Section: Fence-sc Ordermentioning

confidence: 76%

“…NVIDIA GPUs since Kepler have occasionally had memory model issues [6,51,58]. The analysis in this paper aims to place NVIDIA GPU architectures starting from Volta and the PTX ISA from version 6.0 on a solid and more reliable theoretical foundation.…”

Section: Gpu Programming and Memory Modelsmentioning

confidence: 99%

“…• Empirical testing often runs into tractability limits and is inherently incomplete [2,35]. • Memory models change regularly, either intentionally [12] or because bugs are found [11,32,51]. • Writing proofs can be hard and/or tedious, especially because the use of rigorous but pedantic theorem provers such as Coq [4] or HOL [5] is the accepted standard.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Formal Analysis of the NVIDIA PTX Memory Consistency Model

Lustig

Sahasrabuddhe

Giroux

2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

View full text Add to dashboard Cite

This paper presents the first formal analysis of the official memory consistency model for the NVIDIA PTX virtual ISA. Like other GPU memory models, the PTX memory model is weakly ordered but provides scoped synchronization primitives that enable GPU program threads to communicate through memory. However, unlike some competing GPU memory models, PTX does not require data race freedom, and this results in PTX using a fundamentally different (and more complicated) set of rules in its memory model. As such, PTX has a clear need for a rigorous and reliable memory model testing and analysis infrastructure. We break our formal analysis of the PTX memory model into multiple steps that collectively demonstrate its rigor and validity. First, we adapt the English language specification from the public PTX documentation into a formal axiomatic model. Second, we derive an up-to-date presentation of an OpenCL-like scoped C++ model and develop a mapping from the synchronization primitives of that scoped C++ model onto PTX. Third, we use the Alloy relational modeling tool to empirically test the correctness of the mapping. Finally, we compile the model and mapping into Coq and build a full machine-checked proof that the mapping is sound for programs of any size. Our analysis demonstrates that in spite of issues in previous generations, the new NVIDIA PTX memory model is suitable as a sound compilation target for GPU programming languages such as CUDA. CCS Concepts • Hardware → Theorem proving and SAT solving; • Software and its engineering → Consistency.

show abstract

Section: Fence-sc Ordermentioning

confidence: 76%

Section: Gpu Programming and Memory Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Formal Analysis of the NVIDIA PTX Memory Consistency Model

Lustig

Sahasrabuddhe

Giroux

2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

View full text Add to dashboard Cite

show abstract

“…Our experiments show that our implementation, an extension to tsan, can detect races that are beyond the scope of the original tool, and that our extended instrumentation still enables analysis of large applications-the Firefox and Chromium web browsers. Avenues for future work include: developing more advanced heuristics for exploring captured weak behaviours; devising further instrumentation techniques to capture a larger fragment of the memory model; conducting a larger-scale experimental study of data race defects in C++11 software, to understand the extent to which weak memory-related bugs, vs. bugs that can already manifest under SC semantics, are a problem in practice; and designing extensions our technique to cater for the OpenCL memory model [8], facilitating weak-memory aware data race detection for software running on GPU architectures, which are known to have weak memory models [4] that can lead to subtle defects in practical applications [44].…”

Section: Resultsmentioning

confidence: 99%

Dynamic race detection for C++11

Lidbury

Donaldson

2017

SIGPLAN Not.

Self Cite

View full text Add to dashboard Cite

The intricate rules for memory ordering and synchronisation associated with the C/C++11 memory model mean that data races can be difficult to eliminate from concurrent programs. Dynamic data race analysis can pinpoint races in large and complex applications, but the state-of-the-art ThreadSanitizer (tsan) tool for C/C++ considers only sequentially consistent program executions, and does not correctly model synchronisation between C/C++11 atomic operations. We present a scalable dynamic data race analysis for C/C++11 that correctly captures C/C++11 synchronisation, and uses instrumentation to support exploration of a class of non sequentially consistent executions. We concisely define the memory model fragment captured by our instrumentation via a restricted axiomatic semantics, and show that the axiomatic semantics permits exactly those executions explored by our instrumentation. We have implemented our analysis in tsan, and evaluate its effectiveness on benchmark programs, enabling a comparison with the CDSChecker tool, and on two large and highly concurrent applications: the Firefox and Chromium web browsers. Our results show that our method can detect races that are beyond the scope of the original tsan tool, and that the overhead associated with applying our enhanced instrumentation to large applications is tolerable.

show abstract

“…Our method is based on random differential testing [20], though we emphasize that this is not a general purpose approach and is tailored specifically for our use case. For example, we anticipate a false positive rate for kernels with subtle sources of non-determinism which more thorough methods may expose [21][22][23], however we deemed such methods unnecessary for our purpose of performance modeling.…”

Section: Assertmentioning

confidence: 99%

Synthesizing benchmarks for predictive modeling

Cummins

Petoumenos

Wang

et al. 2017

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

View full text Add to dashboard Cite

Predictive modeling using machine learning is an effective method for building compiler heuristics, but there is a shortage of benchmarks. Typical machine learning experiments outside of the compilation field train over thousands or millions of examples. In machine learning for compilers, however, there are typically only a few dozen common benchmarks available. This limits the quality of learned models, as they have very sparse training data for what are often high-dimensional feature spaces. What is needed is a way to generate an unbounded number of training programs that finely cover the feature space. At the same time the generated programs must be similar to the types of programs that human developers actually write, otherwise the learning will target the wrong parts of the feature space. We mine open source repositories for program fragments and apply deep learning techniques to automatically construct models for how humans write programs. We sample these models to generate an unbounded number of runnable training programs. The quality of the programs is such that even human developers struggle to distinguish our generated programs from handwritten code. We use our generator for OpenCL programs, CLgen, to automatically synthesize thousands of programs and show that learning over these improves the performance of a state of the art predictive model by 1.27×. In addition, the fine covering of the feature space automatically exposes weaknesses in the feature design which are invisible with the sparse training examples from existing benchmark suites. Correcting these weaknesses further increases performance by 4.30×.

show abstract

Exposing errors related to weak memory in GPU applications

Cited by 30 publications

References 28 publications

A Formal Analysis of the NVIDIA PTX Memory Consistency Model

A Formal Analysis of the NVIDIA PTX Memory Consistency Model

Dynamic race detection for C++11

Synthesizing benchmarks for predictive modeling

Contact Info

Product

Resources

About