Data, addresses, and instructions are compressed by maintaining only significant bytes with two or three extension bits appended to indicate the signijicant byte positions. This significance compression method is integrated into a 5-stage pipeline, with the extension bitsflowing down the pipeline to enable pipeline operations only for the signijicant bytes. Consequently registel; logic, and cache activity (and dynamic power) are substantially reduced.An initial trace-driven stud.y shows reduction in activity of approximately 30-40% for each pipeline stage. Several pipeline organizations are studied. A byte serial pipeline is the simplest implementation, but suffers a CPI (cycles per instruction) increase of 79% compared with a conventional 32-bit pipeline. Widening certain pipeline stages in order to balance processing bandwidth leads to an implementation with a CPI 24% higher than the baseline 32-bit design. Finally, full-width pipeline stages with operand gating achieve a CPI within 2-6% of the baseline 32-bit pipeline.
One of the main concerns in today's processor design is the issue logic. Instruction-level parallelism is usually favored by an out-of-order issue mechanism where instructions can issue independently of the program order. The out-of-order scheme yields the best performance but at the same time introduces important hardware costs such as an associative look-up, which might be prohibitive for wide issue processors with large instruction windows. This associative search may slow-down the clock-rate and it has an important impact on power consumption. In this work, two new issue schemes that reduce the hardware complexity of the issue logic with minimal ikapact on the average number of instructions executed per cycle are presented.
In IntroductionIt is well known that current superscalar organisations are approaching a point of dimishing returns. It is not trivial to change from a 4-way issue to an 8-way issue architecture due to its hardware complexity and its implications in the cycle time. Nevertheless, the instruction level parallelism (ILP) that an 8-way issue processor can exploit is much beyond that of a 4-way issue one. One of the solutions to this challenge is clustering. Clustering offers the advantages of the partitioned schemes where one can achieve high rates of ILP and sustain a high clock rate. A partitioned architecture tends to make the hardware simpler and its control and data paths faster. For instance, it has fewer register file ports, fewer data bus sources/destinations, and fewer alternatives for many control decisions.Current processors are partitioned into two subsystems (the integer and the floating point one). As it has been shown [14,16], the FP subsystem can be easily extended to execute simple integer and logical operations. For instance, both register files nowadays hold 64-bit values, and simple integer units (no multiplication and division) can be embedded within a small hardware cost due to today's transistor budgets. Furthermore, the hardware modifications required of the existing architectures are minimal. This work focuses on this type of clustered architecture with two clusters, one for integer calculations and another one for integer and floating-point calculations. The advantage of this architecture is that now its floating-point registers, data path and mainly, its issue logic are used 100% of the time in any application.There are two main issues concerning clustered architectures. The first one is the communication overhead between clusters. Since inter-cluster communications can easily take one or more cycles, the higher the number of communications the lower the performance will be due to the delay introduced between dependent instructions. The second issue is the workload balance. If the workload is not optimally balanced, one of the clusters might have more work than it can manage and the other might be less productive than it can be. Thus, in order to achieve the highest performance we have to balance the workload optimally and, at the same time, minimise the number of communications. The workload balance and the communication overhead depend on the technique used to distribute the program instructions between both clusters. Programs can be partitioned either at compile-time (statically) or at run-time (dynamically). The latter approach relies on a steering logic that decides in which cluster each decoded instruction will be executed. The steering logic is responsible for maximising the trade-off between communication and workload balance and therefore, it is a key issue in the design. In this work, a new steering scheme is proposed and its performance is evaluated. We show that the proposed scheme outperforms a previously proposed static approach for the same architecture [16]. Moreov...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.