Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Datta,; Murphy, .; Volkov,; Williams, Michelle D.; Carter, Jonathan; Oliker,; Patterson, Mark; Shalf, John; Yelick,

doi:10.1109/sc.2008.5222004

Cited by 378 publications

(408 citation statements)

References 14 publications

Supporting

Mentioning

399

Contrasting

Unclassified

Order By: Relevance

“…This hand-tuning rarely generalizes well to new hardware generations or different input domains, is prone to error, results in unmaintainable code, and does not even guarantee optimal performance. One of the reasons is that GPU kernels can yield staggeringly large optimization spaces [Datta et al, 2008]. The problem is further compounded by the fact that these spaces can be highly discontinuous [Ryoo et al, 2008], difficult to explore, and optimal performance is often realized at the edge of "performance cliffs" induced by hard device-specific constraints (e.g.…”

Section: Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Machine learning for predictive auto-tuning with boosted regression trees

Bergstra

Pinto

Cox

2012

2012 Innovative Parallel Computing (InPar)

View full text Add to dashboard Cite

The rapidly evolving landscape of multicore architectures makes the construction of efficient libraries a daunting task. A family of methods known collectively as "auto-tuning" has emerged to address this challenge. Two major approaches to auto-tuning are empirical and model-based: empirical autotuning is a generic but slow approach that works by measuring runtimes of candidate implementations, model-based auto-tuning predicts those runtimes using simplified abstractions designed by hand. We show that machine learning methods for non-linear regression can be used to estimate timing models from data, capturing the best of both approaches. A statistically-derived model offers the speed of a model-based approach, with the generality and simplicity of empirical auto-tuning. We validate our approach using the filterbank correlation kernel described in Pinto and Cox [2012], where we find that 0.1 seconds of hill climbing on the regression model ("predictive auto-tuning") can achieve almost the same speed-up as is brought by minutes of empirical auto-tuning. Our approach is not specific to filterbank correlation, nor even to GPU kernel auto-tuning, and can be applied to almost any templated-code optimization problem, spanning a wide variety of problem types, kernel types, and platforms.

show abstract

Section: Motivationmentioning

confidence: 99%

“…Two major auto-tuning approaches have emerged in the extensive literature covering the subject (see surveys in e.g. [Vuduc et al, 2001, Williams, 2008, Datta et al, 2008, Cavazos, 2008, Li et al, 2009, Park et al, 2011): analytical model-driven optimization and empirical optimization [Yotov et al, 2003].…”

Section: Auto-tuningmentioning

confidence: 99%

Machine learning for predictive auto-tuning with boosted regression trees

Bergstra

Pinto

Cox

2012

2012 Innovative Parallel Computing (InPar)

View full text Add to dashboard Cite

show abstract

“…By looping over all cells, it generates the entire S (n) . This is analogous to stencil computation for solving partial differential equations, in which a stencil (equivalent to computation pattern in our framework) defines local computation rules for each grid point and its neighbor grid points and the stencil is applied to all grid points in a lattice to solve the problem [27,28].…”

Section: Uniform-cell Pattern MDmentioning

confidence: 99%

A scalable parallel algorithm for dynamic range-limited n -tuple computation in many-body molecular dynamics simulation

Kunaseth

Kalia

Nakano

et al. 2013

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Recent advancements in reactive molecular dynamics (MD) simulations based on many-body interatomic potentials necessitate efficient dynamic n-tuple computation, where a set of atomic n-tuples within a given spatial range is constructed at every time step. Here, we develop a computation-pattern algebraic framework to mathematically formulate general n-tuple computation. Based on translation/reflection-invariant properties of computation patterns within this framework, we design a shiftcollapse (SC) algorithm for cell-based parallel MD. Theoretical analysis quantifies the compact n-tuple search space and small communication cost of SC-MD for arbitrary n, which are reduced to those in best pair-computation approaches (e.g. eighth-shell method) for n = 2. Benchmark tests show that SC-MD outperforms our production MD code at the finest grain, with 9.7-and 5.1-fold speedups on Intel-Xeon and BlueGene/Q clusters. SC-MD also exhibits excellent strong scalability.

show abstract

“…A number of works have addressed optimizations of stencil computations on emerging multicore platforms [7], [16], [17], [6], [27], [26], [11], [37], [10], [4], [9], [40], [38], [41], [8], [39]. In addition, other transformations such as tiling of stencil computations for multicore architectures have been addressed in [43], [25], [21], [34].…”

Section: Related Workmentioning

confidence: 99%

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Henretty

Stock

Pouchet

et al. 2011

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Stencil computations are at the core of applications in many domains such as computational electromagnetics, image processing, and partial differential equation solvers used in a variety of scientific and engineering applications. Short-vector SIMD instruction sets such as SSE and VMX provide a promising and widely available avenue for enhancing performance on modern processors. However a fundamental memory stream alignment issue limits achieved performance with stencil computations on modern short SIMD architectures. In this paper, we propose a novel data layout transformation that avoids the stream alignment conflict, along with a static analysis technique for determining where this transformation is applicable. Significant performance increases are demonstrated for a variety of stencil codes on several modern processors with SIMD capabilities.

show abstract

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Cited by 378 publications

References 14 publications

Machine learning for predictive auto-tuning with boosted regression trees

Machine learning for predictive auto-tuning with boosted regression trees

A scalable parallel algorithm for dynamic range-limited n -tuple computation in many-body molecular dynamics simulation

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures

Contact Info

Product

Resources

About