Predicting Potential Speedup of Serial Code via Lightweight Profiling and Emulations with Memory Performance Model

Kim, Minjang; Kumar, Pranith; Kim, Hyesoon; Brett, Bevin

doi:10.1109/ipdps.2012.128

Cited by 19 publications

(4 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Kim et al and Wang et al modeled local memory bandwidth for multi-core processors [21,43]. Eklov et al characterized the performance impact of memory contention [44].…”

Section: Related Workmentioning

confidence: 99%

Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines

Wang

Davidson

Soffa

2016

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Modern NUMA platforms offer large numbers of cores to boost performance through parallelism and multi-threading. However, because performance scalability is limited by available memory bandwidth, the strategy of allocating all cores can result in degraded performance. Consequently, accurately predicting optimal (best performing) core allocations, and executing applications with these allocations are crucial for achieving the best performance. Previous research focused on the prediction of optimal numbers of cores. However, in this paper, we show that, because of the asymmetric NUMA memory configuration and the asymmetric application memory behavior, optimal core allocations are not merely optimal numbers of cores. Additionally, previous studies do not adequately consider NUMA memory resources, which further limits their ability to accurately predict optimal core allocations. In this paper, we present a model, NuCore, which predicts both memory bandwidth usage and optimal core allocations. NuCore considers various memory resources and NUMA asymmetry, and employs Integer Programming to achieve high accuracy and low overhead. Experimental results from real NUMA machines show that the core allocations predicted by NuCore provide 1.27x average speedup over using all cores with only 75.6% cores allocated. Nu-Core also provides 1.18x and 1.21x average speedups over two state-of-the-art techniques. Our results also show that NuCore faithfully models NUMA memory systems and predicts memory bandwidth usages with only 10% average error.

show abstract

“…Kim et al and Wang et al modeled local memory bandwidth for multi-core processors [21,43]. Eklov et al characterized the performance impact of memory contention [44].…”

Section: Related Workmentioning

confidence: 99%

Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines

Wang

Davidson

Soffa

2016

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…Kim et al propose an approach that predicts potential speedup from sequential execution [5]. Theoretical analysis of speedup of workloads on modern symmetric and asymmetric is provided in [33].…”

Section: Related Workmentioning

confidence: 99%

“…For example, a speedup model is usually used to quantify the benefits introduced by parallel computing in terms of execution time [5]. Higher concurrency levels, however, affect power dissipation (P ) because not only additional computing units are activated but also the power dissipation of common components on a chip will be shared by more cores.…”

Section: Introductionmentioning

confidence: 99%

Application configuration selection for energy-efficient execution on multicore systems

Wang

Luo

Shi

et al. 2016

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

 We present a hybrid method to achieve an energy efficiency configuration. Our method utilizes concurrency levels, thread allocation, and DVFS settings. We propose a model to capture the relationship between C, P, and T in detail. We apply an analytical speedup model to predict an optimal/nearoptimal configuration. AbstractModern computer systems are designed to balance performance and energy consumption. Several run-time factors, such as concurrency levels, thread mapping strategies, and dynamic voltage and frequency scaling (DVFS) should be considered in order to achieve optimal energy efficiency for a workload. Selecting appropriate run-time factors, however, is one of the most challenging tasks because the run-time factors are architecture-specific and workload-specific.While most existing works concentrate on either static analysis of the workload or run-time prediction results, in this paper, we present a hybrid two-step method that utilizes concurrency levels and DVFS settings to achieve the energy efficiency configuration for a workload. The experimental results based on a Xeon E5620 server with NPB and PARSEC benchmark suites show that the model is able to predict the energy efficient configuration accurately. On average, an additional 10% EDP (Energy Delay Product) saving is obtained by using run-time DVFS for the entire system. An offline optimal solution is used to compare with the proposed scheme. The experimental results show that the average extra EDP saved by the optimal solution is within 5% on selective parallel benchmarks.

show abstract

“…The work by [10] is an early attempt to build a performance prediction model for a given CUDA program, whereas our prediction model is based on a sequential program. The work by [9] estimates potential speed-up using an annotated serial program. However, unlike ours, this work does not consider the data marshaling cost while calculating the speed-ups and the approach does not deal with meeting a target speed-up.…”

Section: Related Workmentioning

confidence: 99%

Execution profile driven speedup estimation for porting sequential code to GPU

Sarkar

Mitra²

2014

Proceedings of the 7th ACM India Computing Conference

View full text Add to dashboard Cite

Parallelization of an existing sequential application to achieve a good speed-up on a data-parallel infrastructure is quite difficult and time consuming effort. One of the important steps towards this is to assess whether the existing application in its current form can be parallelized to get the desired speedup. In this paper, we propose a method of analyzing an existing sequential source code that contains data-parallel loops, and give a reasonably accurate prediction of the extent of speedup possible from this algorithm. The proposed method performs static and dynamic analysis of the sequential source code to determine the time required by various portions of the code, including the data-parallel portions. Subsequently, it uses a set of novel invariants to calculate various bottlenecks that exists if the program is to be transferred to a GPGPU platform and predicts the extent of parallelization necessary by the GPU in order to achieve the desired end-to-end speedup. Our approach does not require creation of GPU code skeletons of the data parallel portions in the sequential code, thereby reducing the performance prediction effort. We observed a reasonably accurate speedup prediction when we tested our approach on multiple well-known Rodinia benchmark applications, a popular matrix multiplication program and a fast Walsh transform program.

show abstract

Predicting Potential Speedup of Serial Code via Lightweight Profiling and Emulations with Memory Performance Model

Cited by 19 publications

References 21 publications

Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines

Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale NUMA machines

Application configuration selection for energy-efficient execution on multicore systems

Execution profile driven speedup estimation for porting sequential code to GPU

Contact Info

Product

Resources

About