Bradley C. Kuszmaul scite author profile

Cilk (pronounced "silk") is a C-based runtime system for multithreaded parallel programming. In this paper, we document the efficiency of the Cilk work-stealing scheduler, both empirically and analytically. We show that on real and synthetic applications, the "work" and "critical-path length" of a Cilk computation can be used to model performance accurately. Consequently, a Cilk programmer can focus on reducing the computation' s work and critical-path length, insulated from load balancing and other runtime scheduling issues. We also prove that for the class of "fully strict" (well-structured) programs, the Cilk scheduler achieves space, time, and communication bounds all within a constant factor of optimal.The Cilk runtime system currently runs on the Connection Machine CM5 MPP, the Intel Paragon MPP, the Sun Sparcstation SMP, and the Cilk-NOW network of workstations. Applications written in Cilk include protein folding, graphic rendering, backtrack search, and the ?Socrates chess program, which won second prize in the 1995 World Computer Chess Championship.

show abstract

Unbounded Transactional Memory

Ananian¹,

Asanović²,

Kuszmaul³

et al.

325

232

View full text Add to dashboard Cite

Hardware transactional memory should support unbounded transactions: transactions of arbitrary size and duration. We describe a hardware implementation of unbounded transactional memory, called UTM, which exploits the common case for performance without sacrificing correctness on transactions whose footprint can be nearly as large as virtual memory. We performed a cycleaccurate simulation of a simplified architecture, called LTM. LTM is based on UTM but is easier to implement, because it does not change the memory subsystem outside of the processor. LTM allows nearly unbounded transactions, whose footprint is limited only by physical memory size and whose duration by the length of a timeslice. We assess UTM and LTM through microbenchmarking and by automatically converting the SPECjvm98 Java benchmarks and the Linux 2.4.19 kernel to use transactions instead of locks. We use both cycle-accurate simulation and instrumentation to understand benchmark behavior. Our studies show that the common case is small transactions that commit, even when contention is high, but that some applications contain very large transactions. For example, although 99.9% of transactions in the Linux study touch 54 cache lines or fewer, some transactions touch over 8000 cache lines. Our studies also indicate that hardware support is required, because some applications spend over half their time in critical regions. Finally, they suggest that hardware support for transactions can make Java programs run faster than when run using locks and can increase the concurrency of the Linux kernel by as much as a factor of 4 with no additional programming work.

show abstract

The pochoir stencil compiler

et al. 2011

View full text Add to dashboard Cite

A stencil computation repeatedly updates each point of a ddimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochoir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm. Pochoir supports general d-dimensional stencils and handles both periodic and aperiodic boundary conditions in one unified algorithm. The Pochoir system provides a C++ template library that allows the user's stencil specification to be executed directly in C++ without the Pochoir compiler (albeit more slowly), which simplifies user debugging and greatly simplified the implementation of the Pochoir compiler itself. A host of stencil benchmarks run on a modern multicore machine demonstrates that Pochoir outperforms standard parallelloop implementations, typically running 2-10 times faster. The algorithm behind Pochoir improves on prior cache-efficient algorithms on multidimensional grids by making "hyperspace" cuts, which yield asymptotically more parallelism for the same cache efficiency.

show abstract

Adversarial contention resolution for simple channels

et al. 2005

View full text Add to dashboard Cite

This paper analyzes the worst-case performance of randomized backoff on simple multiple-access channels. Most previous analysis of backoff has assumed a statistical arrival model.For batched arrivals, in which all n packets arrive at time 0, we show the following tight high-probability bounds. Randomized binary exponential backoff has makespan Θ(n lg n), and more generally, for any constant r, r-exponential backoff has makespan Θ(n log lg r n). Quadratic backoff has makespan Θ((n/ lg n) 3/2 ), and more generally, for r > 1, r-polynomial backoff has makespan Θ((n/ lg n) 1+1/r ). Thus, for batched inputs, both exponential and polynomial backoff are highly sensitive to backoff constants. We exhibit a monotone superpolynomial subexponential backoff algorithm, called loglog-iterated backoff, that achieves makespan Θ(n lg lg n/ lg lg lg n). We provide a matching lower bound showing that this strategy is optimal among all monotone backoff algorithms. Of independent interest is that this lower bound was proved with a delay sequence argument.In the adversarial-queuing model, we present the following stability and instability results for exponential backoff and loglogiterated backoff. Given a (λ, T )-stream, in which at most n = λT packets arrive in any interval of size T , exponential backoff is stable for arrival rates of λ = O(1/ lg n) and unstable for arrival rates of λ = Ω(lg lg n/ lg n); loglog-iterated backoff is stable for arrival rates of λ = O(1/(lg lg n lg n)) and unstable for arrival rates of λ = Ω(1/ lg n). Our instability results show that bursty input is close to being worst-case for exponential backoff and variants and that even small bursts can create instabilities in the channel.

show abstract

The Network Architecture of the Connection Machine CM-5

Leiserson¹,

Abuhamdeh²,

Douglas³

et al. 1996

Journal of Parallel and Distributed Computing

254

119

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.