Joint local and global hardware adaptations for energy

Sasanka, Ruchira; Hughes, Christopher J.; Adve, Sarita V.

doi:10.1145/635506.605413

Cited by 18 publications

(30 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This technique can be used to upsize an adaptive-issue queue. 12 A larger instruction window permits the stall time of instructions waiting in the window to be overlapped with the execution of additional ready instructions in the larger window. However, if this overlap time is not sufficiently large, upsizing the queue will provide little performance benefit.…”

Section: Feedback and Control Systemmentioning

confidence: 99%

Dynamically tuning processor resources with adaptive processing

Albonesi

Balasubramonian

Dropsbo³

et al. 2003

Computer

131

View full text Add to dashboard Cite

show abstract

Section: Feedback and Control Systemmentioning

confidence: 99%

Dynamically tuning processor resources with adaptive processing

Albonesi

Balasubramonian

Dropsbo³

et al. 2003

Computer

131

View full text Add to dashboard Cite

show abstract

“…1 Other work aims to interpret parallelism through fetch 2 and commit attribution, 3,4 and at least one combines attribution with some dependence information. 5 Tune and colleagues first used a dependence graph to compute the cost of individual instructions in a simulator. 6 None of these methodologies has been used to explicitly measure interactions, however, which is our focus.…”

Section: Related Workmentioning

confidence: 99%

Interaction cost: for when event counts just don't add up

Fields¹,

Bodík²,

Hill

et al. 2004

IEEE Micro

View full text Add to dashboard Cite

Most performance analysis tasks boil down to finding bottlenecks. In the context of this article, a bottleneck is any event (for example, branch mispredict, window stall, or arithmetic-logic unit (ALU) operation) that limits performance. Bottleneck analysis is critical to an architect's work, whether the goal is tuning processors for energy efficiency, improving the effectiveness of optimizations, or designing a more balanced processor.Yet, despite its importance, bottleneck analysis methodology has lagged behind processor technology. Although microarchitects have successfully converted the growing supplies of transistors into increased performance, the resulting complexity has made bottleneck analysis much more challenging. Instructions are re-ordered and executed in parallel; processor events such as store-buffer stalls and branch target buffer misses occur simultaneously; and speculative computation is occasionally squashed with control redirected. This complexity and fine-grained parallelism make it difficult to identify what the bottlenecks actually are.For example, when two cache misses occur in the same cycle, we might need to optimize both to increase performance. What if a multiply and window stall occur simultaneously? Is one of them the true bottleneck, or do we again need to optimize both? In general, dozens of events might occur simultaneously in a single cycle on a modern machine. Do we therefore need to optimize all of them to remove the cycle? Or can we get by with just a subset?The key to answering these questions is understanding how bottlenecks interact in a parallel system. A better understanding of interactions could help us improve the performance of not only microarchitectures but also coarse-grained parallel systems, such as chip multiprocessors. Furthermore, studying interactions helps attack the power wall by making the machine more balanced. In other words, interactions help us find the least power-hungry way to achieve a target performance. This article presents the insights for our analysis methodology, which we discuss more extensively elsewhere.1,2 The "Related Work" sidebar notes other analysis methodologies for out-of-order processors. A new bottleneck analysis methodologyTo illustrate the power of interaction-based bottleneck analysis, we show how it can help

show abstract

“…Sasanka, Hughes, and Adve concentrated on saving energy for a sequential application. 4 The targeted architecture has a single processor with reconfigurable components (such as the number and type of function units), and its supply voltage can be changed. For each manually identified scenario, the most energy-efficient architecture configuration that still meets the timing constraints is selected.…”

mentioning

confidence: 99%

Application Scenarios in Streaming-Oriented Embedded-System Design

Gheorghita

Basten

Corporaal

2008

IEEE Des. Test. Comput.

View full text Add to dashboard Cite

&EMBEDDED SYSTEMS USUALLY consist of processors that execute domainspecific applications. Much of their functionality is implemented in software, which runs on one or multiple processors, leaving only the high-performance functions implemented in hardware. Most typical embedded systems (such as TVs, cellular phones, and MP3 players) run multimedia or telecom applications that exhibit dynamic behavior, and their execution costs (such as the number of cycles and energy) depend on the input data. Moreover, these applications are often implemented as a main loop, called the loop of interest, that is executed over and over again, reading, processing, and writing out individual stream objects (see Figure 1). A stream object could be a bit belonging to a compressed bitstream representing a coded video clip, or it could be a macroblock, video frame, audio sample, or network package. Usually, these applications must deliver a given throughput (number of objects per second), which imposes a time constraint on each loop iteration. The read part of the loop of interest takes a stream object from the input stream and separates it into a header and the object's data. The processing part consists of several kernels. The write part sends the processed data to output devices, such as a screen or speakers, and saves the application's internal state for further use; for example, in a video decoder, the previous decoded frame might be necessary to decode the current frame. The dynamism existing in modern applications leads to the use of different kernels for each stream object, depending on the object type. The actions executed in a particular loop iteration form the application's internal operation mode.In this article, we describe a method that provides a systematic way of detecting and exploiting, at design time and runtime, the different internal operation modes. The fact that applications have different internal operation modes has not been fully exploited in embedded-system design thus far. Our approach combines a static analysis and profiling of the system at design time with information collected at runtime about the system's environment. By knowing a system's possible operation modes and information about their resource consumption at design time, it is possible to make specific and aggressive design decisions for each operation mode at different design steps.To avoid complexity problems, we cluster the operation modes that are closely related to one another from a resource consumption perspective in application scenarios, distinguishing the truly different operation modes via different scenarios. It is then possible to derive a faster or lower-energy implementation (for example, by using different source-code optimizations per scenario) or a better estimation of required resources (such as the number of computation cycles or bandwidth). This leads to a smaller, lessexpensive, and more energy-efficient system that can deliver the required performance. 581A design method for handling increasingly dynamic real-time embeddeds...

show abstract

Joint local and global hardware adaptations for energy

Cited by 18 publications

References 31 publications

Dynamically tuning processor resources with adaptive processing

Dynamically tuning processor resources with adaptive processing

Interaction cost: for when event counts just don't add up

Application Scenarios in Streaming-Oriented Embedded-System Design

Contact Info

Product

Resources

About