Microarchitecture Sensitive Empirical Models for Compiler Optimizations

Vaswani, Kapil; Thazhuthaveetil, Matthew J.; Srikant, Y. N.; Joseph, P. Jiji Thomas

doi:10.1109/cgo.2007.25

Cited by 28 publications

(24 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Work has also been done to study the interaction among different optimizations and between optimizations and the hardware without a full search. These range from those based on analytical models [9,17] to those that use statistical models [13] and those that utilize adaptive learning and intelligent search techniques [3,4,26,27] to find an optimal configuration. Finally, work by the SPIRAL project [2] generally uses an iterative approach to find desirable code, whereas we do not.…”

Section: Related Workmentioning

confidence: 99%

Program optimization space pruning for a multithreaded gpu

Ryoo

Rodrigues

Stone

et al. 2008

Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization

227

137

View full text Add to dashboard Cite

Program optimization for highly-parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly-parallel applications for these platforms, who lack the substantial experience and knowledge needed to maximize their performance. This creates a need for more structured optimization methods with means to estimate their performance effects. Furthermore these methods need to be understandable by most programmers. This paper shows the complexity involved in optimizing applications for one such system and one relatively simple methodology for reducing the workload involved in the optimization process.This work is based on one such highly-parallel system, the GeForce 8800 GTX using CUDA. Its flexible allocation of resources to threads allows it to extract performance from a range of applications with varying resource requirements, but places new demands on developers who seek to maximize an application's performance. We show how optimizations interact with the architecture in complex ways, initially prompting an inspection of the entire configuration space to find the optimal configuration. Even for a seemingly simple application such as matrix multiplication, the optimal configuration can be unexpected. We then present metrics derived from static code that capture the first-order factors of performance. We demonstrate how these metrics can be used to prune many optimization configurations, down to those that lie on a Pareto-optimal curve. This reduces the optimization space by as much as 98% and still finds the optimal configuration for each of the studied applications.

show abstract

Section: Related Workmentioning

confidence: 99%

Program optimization space pruning for a multithreaded gpu

Ryoo

Rodrigues

Stone

et al. 2008

Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization

227

137

View full text Add to dashboard Cite

show abstract

“…Vaswani et al [VTSJ07] build regression models that relate a benchmark's performance to micro-architectural parameters, compiler optimization flags, and associated compiler optimization heuristic parameters (for instance maximum loop unrolling). They use these models to (a) predict performance at arbitrary compiler and micro-architecture settings, (b) identify micro-architectural features that interact (both beneficially and detrimentally) with compiler optimization settings, and finally (c) find optimal settings for a particular program.…”

Section: Modeling Micro-architecture Parametersmentioning

confidence: 99%

Measuring empirical computational complexity

Goldsmith

Aiken

Wilkerson

2007

Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the F

117

View full text Add to dashboard Cite

We propose a method for describing the asymptotic behavior of programs in practice by measuring their empirical computational complexity. Our method involves running a program on workloads spanning several orders of magnitude in size, measuring their performance, and fitting these observations to a model that predicts performance as a function of workload size. Comparing these models to the programmer's expectations or to theoretical asymptotic bounds can reveal performance bugs or confirm that a program's performance 2 scales as expected.We develop our methodology for constructing these models of empirical complexity as we describe and evaluate two techniques. Our first technique, BB-TrendProf, constructs models that predict how many times each basic block runs as a linear (y = a + bx) or a powerlaw (y = ax b ) function of user-specified features of the program's workloads. To present output succinctly and focus attention on scalability-critical code, BB-TrendProf groups and ranks program locations based on these models. We demonstrate the power of BB-TrendProf compared to existing tools by running it on several large programs and reporting cases where its models show (1) an implementation of a complex algorithm scaling as expected, (2) two complex algorithms beating their worst-case theoretical complexity bounds when run on realistic inputs, and (3) a performance bug.Our second technique, CF-TrendProf, models performance of loops and functions both per-function-invocation and per-workload. It improves upon the precision of BB-TrendProf's models by using control flow to generate candidates from a richer family of models and a novel model selection criteria to select among these candidates. We show that CF-TrendProf's improvements to model generation and selection allow it to correctly characterize or closely approximate the empirical scalability of several well-known algorithms and data structures and to diagnose several synthetic, but realistic, scalability problems without observing an egregiously expensive workload. We also show that CF-TrendProf deals with multiple workload features better than BB-TrendProf. We qualitatively compare the output of BB-TrendProf and CF-TrendProf and discuss their relative strengths and weaknesses. ii

show abstract

“…This is similar to the character- isation of the space that we conduct in section 7. The other schemes are similar in terms of accuracy [18,27]. However, none of these papers characterise the space that they are exploring by showing the correlation of microarchitectural parameters to the best configurations.…”

Section: Related Workmentioning

confidence: 99%

“…• Whenever a new program is considered, a new predictor must be trained and built, meaning there is a large overhead even if the designer just wants to compile with a different optimisation level [27]. Our approach learns across programs and captures the behaviour of the architecture rather than the program itself;…”

Section: Introductionmentioning

confidence: 99%