Optimistic parallelism requires abstractions

Kulkarni, Milind; Pingali, Keshav; Walter, Bruce; Ramanarayanan, Ganesh; Bala, Kavita; Chew, L. Paul

doi:10.1145/1562164.1562188

Cited by 76 publications

(115 citation statements)

References 19 publications

Supporting

Mentioning

113

Contrasting

Unclassified

Order By: Relevance

“…The sequential implementation is written in C++ and compiled with gcc and the -O3 flag. The reference implementation is written in Java on top of the Galois framework [21]. The Java Virtual Machine used is the 64-bit Sun HotSpot server version 1.6.0 24.…”

Section: Experimental Evaluationmentioning

confidence: 99%

A GPU implementation of inclusion-based points-to analysis

Méndez-Lojo

Burtscher

Pingali

2012

Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Self Cite

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate on pointer-based data structures such as graphs. For the most part, research has focused on GPU implementations of graph analysis algorithms that do not modify the structure of the graph, such as algorithms for breadth-first search and strongly-connected components.In this paper, we describe a high-performance GPU implementation of an important graph algorithm used in compilers such as gcc and LLVM: Andersen-style inclusion-based points-to analysis. This algorithm is challenging to parallelize effectively on GPUs because it makes extensive modifications to the structure of the underlying graph and performs relatively little computation. In spite of this, our program, when executed on a 14 Streaming Multiprocessor GPU, achieves an average speedup of 7x compared to a sequential CPU implementation and outperforms a parallel implementation of the same algorithm running on 16 CPU cores.Our implementation provides general insights into how to produce high-performance GPU implementations of graph algorithms, and it highlights key differences between optimizing parallel programs for multicore CPUs and for GPUs.

show abstract

Section: Experimental Evaluationmentioning

confidence: 99%

A GPU implementation of inclusion-based points-to analysis

Méndez-Lojo

Burtscher

Pingali

2012

Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Self Cite

View full text Add to dashboard Cite

show abstract

“…Isolation types are also similar to the idea of transactional boosting, coarse-grained transactions, and semantic commutativity [16,19,20], which eliminate false conflicts by raising the abstraction level. Isolation types go farther though: for example, the type versioned T does not just avoid false conflicts, but resolves true conflicts deterministically (in a not necessarily serializable way).…”

Section: Related Workmentioning

confidence: 99%

Semantics of Concurrent Revisions

Burckhardt

Leijen

2011

Programming Languages and Systems

View full text Add to dashboard Cite

Abstract. Enabling applications to execute various tasks in parallel is difficult if those tasks exhibit read and write conflicts. We recently developed a programming model based on concurrent revisions that addresses this challenge in a novel way: each forked task gets a conceptual copy of all the shared state, and state changes are integrated only when tasks are joined, at which time write-write conflicts are deterministically resolved.In this paper, we study the precise semantics of this model, in particular its guarantees for determinacy and consistency. First, we introduce a revision calculus that concisely captures the programming model. Despite allowing concurrent execution and locally nondeterministic scheduling, we prove that the calculus is confluent and guarantees determinacy. We show that the consistency guarantees of our calculus are a logical extension of snapshot isolation with support for conflict resolution and nesting. Moreover, we discuss how custom merge functions can provide stronger guarantees for particular data types that are tailored to the needs of the application.Finally, we show we can visualize the nonlinear history of state in our computations using revision diagrams that clarify the synchronization between tasks and allow local reasoning about state updates.

show abstract

“…The current threading library (e.g., Pthreads), a combination of compiler directives and libraries (e.g., OpenMP) and optimistic parallelization [1][2][3] were not designed to support programming for tolerating off-chip latency, or to handle efficient allocation and movement of data across hierarchy levels. It is often the case that in the underline thread execution model, a thread is enabled and activated as soon as all data and control dependencies are satisfied.…”

Section: Introductionmentioning

confidence: 99%

Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture

Tan

Sreedhar²,

Gao

2008

Languages and Compilers for Parallel Computing

View full text Add to dashboard Cite

Abstract. This paper presents a new technique to optimize locality of irregular programs by leveraging parallelism on a massive many-core architecture -IBM Cyclops64 (C64). The key idea is to achieve Just-In-Time Locality which ensures that data are available locally for computation to use. The proposed percolation model for Just-In-Time Locality moves data proactively close to the computation and organizes the data layout such that locality is exploited effectively. The percolation model opens a door for exploiting locality through parallelism, which is an advantage of the future many-core architecture. We implemented the percolation strategy in the context of two irregular applications on C64. Our experimental results are very encouraging and we get an order of magnitude improvement in performance of irregular applications. We also drastically improve the scalability of the applications that we studied.

show abstract

Optimistic parallelism requires abstractions

Cited by 76 publications

References 19 publications

A GPU implementation of inclusion-based points-to analysis

A GPU implementation of inclusion-based points-to analysis

Semantics of Concurrent Revisions

Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture

Contact Info

Product

Resources

About