Scaling the Power Wall: A Path to Exascale

Villa, Oreste; Johnson, D.; O'Connor, Mike; Bolotin, Evgeny; Nellans, David; Luitjens, Justin; Sakharnykh, Nikolai; Wang, Peng; Micikevicius, Paulius; Scudiero, Anthony; Keckler, Stephen W.; Dally, William J.

doi:10.1109/sc.2014.73

Cited by 100 publications

(38 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A direct port of the SNAP mini-app using the CUDA framework by P. Wang et al exists and uses a similar parallelisation scheme to the original code [23]. In our previous work we showed that on a single node this approach did not improve the performance over our benchmark CPU despite the use of GPUs [9].…”

Section: Related Workmentioning

confidence: 92%

Many-Core Acceleration of a Discrete Ordinates Transport Mini-App at Extreme Scale

Deakin

McIntosh–Smith

Gaudin

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Time-dependent deterministic discrete ordinates transport codes are an important class of application which provide significant challenges for large, many-core systems. One such challenge is the large memory capacity needed by the solve step, which requires us to have a scalable solution in order to have enough node-level memory to store all the data. In our previous work, we demonstrated the first implementation which showed a significant performance benefit for single node solves using GPUs. In this paper we extend our work to large problems and demonstrate the scalability of our solution on two Petascale GPU-based supercomputers: Titan at Oak Ridge and Piz Daint at CSCS. Our results show that our improved node-level parallelism scheme scales just as well across large systems as previous approaches when using the tried and tested KBA domain decomposition technique. We validate our results against an improved performance model which predicts the runtime of the main 'sweep' routine when running on different hardware, including CPUs or GPUs.

show abstract

Section: Related Workmentioning

confidence: 92%

Many-Core Acceleration of a Discrete Ordinates Transport Mini-App at Extreme Scale

Deakin

McIntosh–Smith

Gaudin

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…We have used CACTI for estimating the power consumption of caches; however since our accelerator is synthesized for 22nm, we have scaled down the area and power values generated by CACTI from 32nm to 22nm. For scaling area, coefficient of 0.5 is used [9,10], whereas for scaling power, coefficients of 0.569 [16] (dynamic) and 0.8 [32] (leakage) are used. For DRAM power consumption of both template-based and the HLS accelerators, we have integrated the DRAMSim2 tool into our simulators and used the aforementioned DDR4 model for power and timing estimations.…”

Section: ) Power Performance and Area Resultsmentioning

confidence: 99%

A Template-Based Design Methodology for Graph-Parallel Hardware Accelerators

Ayupov

Yesil

Özdal

et al. 2018

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

Abstract-Graph applications have been gaining importance in the last decade due to emerging big data analytics problems such as web graphs, social networks, and biological networks. For these applications, traditional CPU and GPU architectures suffer in terms of performance and power consumption due to irregular communications, random memory accesses, and load balancing problems. It has been shown that specialized hardware accelerators can achieve much better power and energy efficiency compared to the general purpose CPUs and GPUs. In this work, we present a template-based methodology specifically targeted for hardware accelerator design of big-data graph applications. Important architectural features that are key for energy efficient execution are implemented in a common template. The proposed template-based methodology is used to design hardware accelerators for different graph applications with little effort. Compared to an application-specific high-level synthesis (HLS) methodology, we show that the proposed methodology can generate hardware accelerators with up to 18x better energy efficiency and requires less design effort.

show abstract

“…Modern Central Processing Unit (CPU) performance and speed have begun to plateau over recent years due to "the power wall," thus prompting more research into multi-core and many-core systems [30]. General Purpose Graphics Processing Unit (GPGPU) programming is a programming paradigm which employs Graphics Processing Units (GPUs) to run code typically executed on the CPU in order to provide performance gain by way of parallelism and data throughput.…”

Section: The Problem Statementmentioning

confidence: 99%

GPU-accelerated feature tracking

Graves

2016

2016 IEEE National Aerospace and Electronics Conference (NAECON) and Ohio Innovation Summit (OIS)

View full text Add to dashboard Cite

Graves, Alex. M.S., Department of Computer Science and Engineering, Wright State University, 2016. GPU-Accelerated Feature Tracking.The motivation of this research is to prove that GPUs can provide significant speedup of long-executing image processing algorithms by way of parallelization and massive data throughput. This thesis accelerates the well-known KLT feature tracking algorithm using OpenCL and an NVidia GeForce GTX 780 GPU. KLT is a fast, efficient and accurate feature tracker but can easily suffer from low frame rates when tracking many features in an HD video sequence. This research explains how KLT could benefit from GPGPU programming and provides the corresponding OpenCL implementation.Additionally, various optimization techniques are emphasized to further boost GPU performance. The experiments conducted prove that when tracking over 500 features in an HD dataset, GPU-based KLT provides a 92% reduction in total runtime compared to a CPU-based implementation. Furthermore, the experiments demonstrate that these features are tracked while maintaining similar accuracy to the CPU results.iv

show abstract

Scaling the Power Wall: A Path to Exascale

Cited by 100 publications

References 28 publications

Many-Core Acceleration of a Discrete Ordinates Transport Mini-App at Extreme Scale

Many-Core Acceleration of a Discrete Ordinates Transport Mini-App at Extreme Scale

A Template-Based Design Methodology for Graph-Parallel Hardware Accelerators

GPU-accelerated feature tracking

Contact Info

Product

Resources

About