Quantifying acceleration: Power/performance trade-offs of application kernels in hardware

Reagen, Brandon; Shao, Yakun Sophia; Wei, Gu-Yeon; Brooks, David

doi:10.1109/islped.2013.6629329

Cited by 15 publications

(6 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, in multicore systems where dozens of cores are involved, it is difficult to find configurations that maximize throughput with a power constraint [22]. In [23] Reagen et al explored the design space for lowpower accelerators used in SoC designs to obtain power and performance trade-off. Qadri et al [10] proposed a scheme called E-FLORE to optimize multicore architectures.…”

Section: Related Workmentioning

confidence: 99%

Estimation of distribution-based multiobjective design space exploration for energy and throughput-optimized MPSoCs

Murad¹,

Hussain²,

Ahmad³

et al. 2020

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

Modern multicore architectures comprise a large set of components and parameters that require being matched to achieve the best balance between power consumption and throughput performance for a particular application domain.The exploration of design space for finding the best power-throughput trade-off is a combinatorial optimization problem with a large number of combinations, and. in general, its solution is prohibitively difficult to be explored exhaustively.However, fortunately, evolutionary algorithms (EAs) have the potential to efficiently solve this problem with reasonable computational complexity. In this paper, we consider a multiobjective design space exploration (DSE) problem with two conflicting objectives. The first objective corresponds to power consumption minimization while the second objective relates to throughput maximization. We approach this problem by employing the estimation of distribution algorithm (EDA), which belongs to the family of EAs. The proposed EDA-based DSE (EDA-DSE) scheme efficiently selects the design parameters (i.e. cache size, number of cores, and operating frequency) with an efficient power-throughput ratio.The proposed scheme is verified using cycle-accurate simulations over a set of benchmarks and the simulation results show a significant reduction in energy-delay product for all benchmark applications when compared to the default baseline configuration and genetic algorithm.

show abstract

Section: Related Workmentioning

confidence: 99%

Estimation of distribution-based multiobjective design space exploration for energy and throughput-optimized MPSoCs

Murad¹,

Hussain²,

Ahmad³

et al. 2020

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

show abstract

“…Specifying more memory banks and more functional units yield better performance with increased power costs [24], but take the data transmission bandwidth restraint into account, additional partitioning becomes wasteful for there is no enough bandwidth to feed data efficiently. In the proposed accelerator, we use six PEs, and four of them have the same structure to complete a four-channel parallel operation and we divide the scratchpad into 32 banks, with the total size of 2 MByte.…”

Section: Algorithm Speedup Ratiomentioning

confidence: 99%

Design and Application Space Exploration of a Domain-Specific Accelerator System

Feng

Wang

et al. 2018

Electronics

View full text Add to dashboard Cite

Domain-specific accelerators are a reaction adapting to device scaling and the dark silicon era. This paper describes a radar signal processing oriented configurable accelerator and the application space exploration of the system. The system is built around accelerator engines and general-purpose processors (GPPs) that make it suitable for intensive computing kernel acceleration and complex control tasks. It is geared toward high-performance radar digital signal processing; we characterize the applications and find that each of them contains a series of serializable kernels. Taking advantage of this discovery, we design an algorithm pool that shares the same computation resource and memory resource, and each algorithm is size reconfigurable. On the other hand, shared on-chip addressable scratchpad memory eliminates unnecessary explicit data copy between accelerators. Performance of the system is evaluated from measurements performed both on an FPGA SoC test chip and on a prototype chip fabricated by CMOS 40 nm technology. The experimental results show that for different algorithms, the proposed system achieves 1.9× to 10.1× performance gain compared with a state-of-the-art TI DSP chip. In order to characterize the application of the system, a complex real-life task is adopted, and the results show that it can obtain high throughput and desirable precision.

show abstract

“…Accelerators have been shown to provide orders of magnitude more energy efficiency than general-purpose computing. Despite their narrow domain, the lack of predetermined structures and the ability to consider radically different techniques leads to a large design space [15].…”

Section: Hardware Acceleratorsmentioning

confidence: 99%

A case for efficient accelerator design space exploration via Bayesian optimization

Reagen

Hernández-Lobato

Adolf

et al. 2017

2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)

Self Cite

View full text Add to dashboard Cite

As the DNN accelerator space consists of upwards of 14 free parameters, a thorough exploration is infeasible. The second problem is that evaluating the energy and accuracy of a single design point is costly as it involves both training a DNN and simulating hardware. This expense limits the number of points that can be explored, requiring good designs to be found with minimal point evaluations. Finally, the effects of the parameters on the objectives (energy and accuracy), as well as between the parameters themselves, are intertwined, creating an unintuitive and rough optimization landscape. Consider optimizing just two parameters: the bitwidth of fixed-point datatypes (a well known energy efficiency optimization [16]), and the L2 regularization parameter, used to prevent overfitting during training. While regularization is only intended to directly affect the accuracy objective, it also indirectly affects the energy objective: regularizing weights compresses their dynamic range, impacting the number of bits required to store the weights. Likewise, using a fixed-point datatype does not directly affect the regularization parameter, but it does reduce the resolution of a model's weights and the overall model accuracy, which both end up affecting the choice of regularization strength. Nearly all the design parameters in Table I impact both objectives, and the complex interactions between them are difficult to capture using traditional modeling techniques used to reason about hardware design spaces [11], [12].

show abstract

Quantifying acceleration: Power/performance trade-offs of application kernels in hardware

Cited by 15 publications

References 10 publications

Estimation of distribution-based multiobjective design space exploration for energy and throughput-optimized MPSoCs

Estimation of distribution-based multiobjective design space exploration for energy and throughput-optimized MPSoCs

Design and Application Space Exploration of a Domain-Specific Accelerator System

A case for efficient accelerator design space exploration via Bayesian optimization

Contact Info

Product

Resources

About