A compiler optimization algorithm for shared-memory multiprocessors

McKinley, Kathryn S.

doi:10.1109/71.706049

Cited by 24 publications

(14 citation statements)

References 43 publications

(70 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Optimizing one loop in isolation was and is not sufficient for the best performance; data reorganization, loop fusion, and loop distribution add substantial benefits [7,10,11,16,21,22,25]. Our model directly computed the best permutations for locality and parallelism and we showed [7,16,20] how to directly derive good loop fusion and distribution choices, which the polyhedral model could not yet perform. Our approach also had the advantage that the resulting code was human readable and suitable for use in an interactive parallelization tool [17].…”

Section: Historical Positioningmentioning

confidence: 95%

Author retrospective for optimizing for parallelism and data locality

McKinley

2014

25th Anniversary International Conference on Supercomputing Anniversary Volume -

View full text Add to dashboard Cite

Today there is an urgent need for algorithms, programming language systems and tools, and hardware that deliver on the potential of parallelism due to the end of Dennard scaling. This work (from my PhD dissertation, supervised by Ken Kennedy) was one of the early papers to optimize for and experimentally explore the tension between data locality and parallelism on shared memory machines. A key result was that false sharing of cache lines between processors with local caches on separate chips was disastrous to the performance and scaling of applications. This retrospective includes a short personal tour through the history of parallel computing, a discussion of locality and parallelism modeling versus a polyhedral formulation of optimizing dense matrix codes, and how this problem is still relevant to compilers today. I end with a short memorial to my deceased co-author and advisor Ken Kennedy. Parallel computing seemed to be entering its heyday in the late 1980s and early 1990s. At Rice in 1989, Ken Kennedy was awarded an NSF Science and Technology Center for the Center for Research on Parallel Computing (CRPC) as the Principal Investigator. The CRPC started with seven sites and eventually included 400 researchers, staff, and graduate students. Their technical expertise spanned parallel algorithms, compilers, runtimes, and hardware. The CRPC vision that Ken, his collaborators, and students shared was to invent parallel algorithms for critical problems in science, coupled with programming language tools, such as compilers, runtime systems, and programming environments, that made Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Copyright is held by the author/owner(s). them run fast. We were not trying to solve the dusty-deck problem of automatically converting sequential algorithms to parallel ones. We understood that parallel and sequential algorithms for the same problem require different solutions. However, tools would do heavy lifting to map application parallelism to hardware parallelism, such that the programmers would not have to reimplement their algorithms for each new parallel architecture. A key aspect of this problem is balancing parallelism, sharing between tasks, and memory usage, which was the topic our paper addressed.In this same period, a number of established companies and startups, such as Sequent, had introduced parallel machines. The Sequent Symmetry was the machine on which we reported our results. It was not yet clear that the research and development challenges of parallel computing would make it too costly to win in the market place in the short term. By the mid 1990s, this generation of parallel computers together with some of the compa...

show abstract

Section: Historical Positioningmentioning

confidence: 95%

Author retrospective for optimizing for parallelism and data locality

McKinley

2014

25th Anniversary International Conference on Supercomputing Anniversary Volume -

View full text Add to dashboard Cite

show abstract

“…Modern optimizing compilers try to achieve an efficient exploitation of the memory hierarchy by reordering the instructions of the source program [2,49]. A lot of the research in this area concentrates on computationally intensive loops [24,25,35], and loop tiling [4,44] is considered to be one of the most successful techniques.…”

Section: Related Workmentioning

confidence: 99%

Parallel Low-Storage Runge—Kutta Solvers for ODE Systems with Limited Access Distance

Korch

Rauber

2010

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

show abstract

“…Ferrante et al [10] determines the innermost loop with overflowing caches using the number of distinct cache lines accessed inside a loop to guide transformations like loop interchange. McKinley's cache model [19] is based on equivalence classes of array references showing temporal and spatial locality. Ghosh et al [11] introduces cache miss equations based on a system of linear Diophantine equations from a reuse vector.…”

Section: Related Workmentioning

confidence: 99%