CPR: Composable performance regression for scalable multiprocessor models

Lee, Benjamin C.; Collins, Jamison D.; Wang, Hong; Brooks, David

doi:10.1109/micro.2008.4771797

Cited by 71 publications

(33 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Performance models have been increasingly used for application tuning over complex or large scale systems [3], [12], [17]. These models target performance over a cluster or a heterogeneous platform, with a focus on the modeling and optimization of communication and scheduling among nodes.…”

Section: Related Workmentioning

confidence: 99%

Improving GPU Performance Prediction with Data Transfer Modeling

Boyer

Meng

Kumaran

2013

2013 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum

View full text Add to dashboard Cite

Abstract-Accelerators such as graphics processors (GPUs) have become increasingly popular for high performance scientific computing. Often, much effort is invested in creating and optimizing GPU code without any guaranteed performance benefit. To reduce this risk, performance models can be used to project a kernel's GPU performance potential before it is ported. However, raw GPU execution time is not the only consideration. The overhead of transferring data between the CPU and the GPU is also an important factor; for some applications, this overhead may even erase the performance benefits of GPU acceleration.To address this challenge, we propose a GPU performance modeling framework that predicts both kernel execution time and data transfer time. Our extensions to an existing GPU performance model include a data usage analyzer for a sequence of GPU kernels, to determine the amount of data that needs to be transferred, and a performance model of the PCIe bus, to determine how long the data transfer will take. We have tested our framework using a set of applications running on a production machine at Argonne National Laboratory. On average, our model predicts the data transfer overhead with an error of only 8%, and the inclusion of data transfer time reduces the error in the predicted GPU speedup from 255% to 9%.

show abstract

Section: Related Workmentioning

confidence: 99%

Improving GPU Performance Prediction with Data Transfer Modeling

Boyer

Meng

Kumaran

2013

2013 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum

View full text Add to dashboard Cite

show abstract

“…Models for microprocessor cores and mechanisms to account for interactions would provide a more thorough assessment of multiprocessor performance and power. Building on uniprocessor core models, a potential multiprocessor framework might use a combination of uniprocessor, contention, and penalty models [Lee et al 2008].…”

Section: Discussionmentioning

confidence: 99%

Applied inference

Lee

Brooks

2010

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

We propose and apply a new simulation paradigm for microarchitectural design evaluation and optimization. This paradigm enables more comprehensive design studies by combining spatial sampling and statistical inference. Specifically, this paradigm (1) defines a large, comprehensive design space, (2) samples points from the space for simulation, and (3) constructs regression models based on sparse simulations. This approach greatly improves the computational efficiency of microarchitectural simulation and enables new capabilities in design space exploration.We illustrate new capabilities in three case studies for a large design space of approximately 260,000 points: (1) Pareto frontier, (2) pipeline depth, and (3) multiprocessor heterogeneity analyses. In particular, regression models are exhaustively evaluated to identify Pareto optimal designs that maximize performance for given power budgets. These models enable pipeline depth studies in which all parameters vary simultaneously with depth, thereby more effectively revealing interactions with non-depth parameters. Heterogeneity analysis combines regression based optimization with clustering heuristics to identify efficient design compromises between similar optimal architectures. These compromises are potential core designs in a heterogeneous multicore architecture. Increasing heterogeneity can improve bips 3 /w efficiency by as much as 2.4x, a theoretical upper bound on heterogeneity benefits that neglects contention between shared resources as well as design complexity. Collectively these studies demonstrate regression models' ability to expose trends and identify optima in diverse design regions, motivating the application of such models in statistical inference for more effective use of modern simulator infrastructure.

show abstract

“…The SDCs serve as input to a cache contention model that estimates the additional number of conflict misses due to cache sharing in the LLC. There exist several cache contention models [Chandra et al 2005;Eklöv et al 2011;Lee et al 2008]. We use the Frequency of Access (FOA) model proposed by Chandra et al [2005] because it is a fairly simple model and we found it accurate enough for our needs.…”

Section: Iterative Multicore Performance Estimationmentioning

confidence: 99%

Understanding fundamental design choices in single-ISA heterogeneous multicore architectures

Craeynest

Eeckhout

2013

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Single-ISA heterogeneous multicore processors have gained substantial interest over the past few years because of their power efficiency, as they offer the potential for high overall chip throughput within a given power budget. Prior work in heterogeneous architectures has mainly focused on how heterogeneity can improve overall system throughput. To what extent heterogeneity affects per-program performance has remained largely unanswered. In this article, we aim at understanding how heterogeneity affects both chip throughput and per-program performance; how heterogeneous architectures compare to homogeneous architectures under both performance metrics; and how fundamental design choices, such as core type, cache size, and off-chip bandwidth, affect performance.We use analytical modeling to explore a large space of single-ISA heterogeneous architectures. The analytical model has linear-time complexity in the number of core types and programs of interest, and offers a unique opportunity for exploring the large space of both homogeneous and heterogeneous multicore processors in limited time. Our analysis provides several interesting insights: While it is true that heterogeneity can improve system throughput, it fundamentally trades per-program performance for chip throughput; although some heterogeneous configurations yield better throughput and per-program performance than homogeneous designs, some homogeneous configurations are optimal for particular throughput versus perprogram performance trade-offs. Two core types provide most of the benefits from heterogeneity and a larger number of core types does not contribute much; job-to-core mapping is both important and challenging for heterogeneous multicore processors to achieve optimum performance. Limited off-chip bandwidth does alter some of the fundamental design choices in heterogeneous multicore architectures, such as the need for large on-chip caches for achieving high throughput, and per-program performance degrading more relative to throughput under constrained off-chip bandwidth.

show abstract

CPR: Composable performance regression for scalable multiprocessor models

Cited by 71 publications

References 11 publications

Improving GPU Performance Prediction with Data Transfer Modeling

Improving GPU Performance Prediction with Data Transfer Modeling

Applied inference

Understanding fundamental design choices in single-ISA heterogeneous multicore architectures

Contact Info

Product

Resources

About