Performance modeling for systematic performance tuning

Hoefler, Torsten; Gropp, William; Kramer, William; Snir, Marc

doi:10.1145/2063348.2063356

Cited by 69 publications

(37 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another metric analyzed was the number of floating-point operations in each invocation of the time-intensive kernels as a function of the number of grid points per process. The results in Table 3 show that the number of floating-point operations per kernel invocation is proportional to the number of grid points (rightmost column), which is again consistent with [16]. All kernels but the conjugate-gradient kernel (ks_congrad) have a constant number of invocations, whereas the number of times the conjugate-gradient kernel is invoked depends for this particular input matrix on the number of grid points (middle column).…”

supporting

confidence: 72%

“…Since there is no performance variation in these requirements measurements, the quality of the automated fit (and thus the confidence) is high, resulting in a model that matches the handcrafted counterpart exactly. Our method also found the number of messages in each kernel to be invariant regardless of the lattice size, which further matches the models in [16]. Another metric analyzed was the number of floating-point operations in each invocation of the time-intensive kernels as a function of the number of grid points per process.…”

supporting

confidence: 72%

“…via parallel simulations of the SU(3) lattice gauge theory on a four-dimensional lattice. In earlier work [16], analytical models were manually created that describe the behavior of MILC/su3 rmd, one of the MILC codes, by characterizing its most important components with respect to a number of parameters. We now show that our modeling tool chain allows similar models to be derived automatically.…”

Section: Milcmentioning

confidence: 99%

“…Given that MILC is known to scale well, we refined the default setting for I by adding , as suggested in Section 2.3. We collected five data points for each function at the scales P3 = 2 7 , 2 8 , 2 9 , 2 10 , 2 11 , 2 12 , 2 13 , 2 14 , 2 15 , 2 16 with a local lattice size of V = 9 4 per process. All model functions generated for Juqueen are shown in Table 2.…”

Section: Milcmentioning

confidence: 99%

See 3 more Smart Citations

Using automated performance modeling to find scalability bugs in complex codes

Calotoiu

Hoefler

Poke

et al. 2013

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Self Cite

111

View full text Add to dashboard Cite

Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made-a point where remediation can be difficult. However, creating analytical performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. In this paper, we show how both coverage and speed of this scalability analysis can be substantially improved. Generating an empirical performance model automatically for each part of a parallel program, we can easily identify those parts that will reduce performance at larger core counts. Using a climate simulation as an example, we demonstrate that scalability bugs are not confined to those routines usually chosen as kernels.

show abstract

supporting

confidence: 72%

supporting

confidence: 72%

Section: Milcmentioning

confidence: 99%

Section: Milcmentioning

confidence: 99%

See 2 more Smart Citations

Using automated performance modeling to find scalability bugs in complex codes

Calotoiu

Hoefler

Poke

et al. 2013

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Self Cite

111

View full text Add to dashboard Cite

show abstract

“…Partial execution [35] can improve those techniques. Other studies provide advice for modeling the general performance [21] and scalability [19] of parallel applications. In addition, many application-specific studies exist but cannot be generalized [7,24].…”

Section: Related Workmentioning

confidence: 99%

Automatic complexity analysis of explicitly parallel programs

Hoefler

Kwasniewski

2014

Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures

Self Cite

View full text Add to dashboard Cite

The doubling of cores every two years requires programmers to expose maximum parallelism. Applications that are developed on today's machines will often be required to run on many more cores. Thus, it is necessary to understand how much parallelism codes can expose. The work and depth model provides a convenient mental framework to assess the required work and the maximum parallelism of algorithms and their parallel efficiency. We propose an automatic analysis to extract work and depth from a source-code. We do this by statically counting the number of loop iterations depending on the set of input parameters. The resulting expression can be used to assess work and depth with regards to the program inputs. Our method supports the large class of practically relevant loops with affine update functions and generates additional parameters for other expressions. We demonstrate how this method can be used to determine work and depth of several real-world applications. Our technique enables us to prove if the theoretically maximum parallelism is exposed in a practical implementation of a problem. This will be most important for future-proof software development.

show abstract