Abstract-Efficiency and flexibility are critical, but often conflicting, design goals in embedded system design. The recent emergence of extensible processors promises a favorable tradeoff between efficiency and flexibility, while keeping design turnaround times short. Current extensible processor design flows automate several tedious tasks, but typically require designers to manually select the parts of the program that are to be implemented as custom instructions.In this work, we describe an automatic methodology to select custom instructions to augment an extensible processor, in order to maximize its efficiency for a given application program. We demonstrate that the number of custom instruction candidates grows rapidly with program size, leading to a large design space, and that the quality (speedup) of custom instructions varies significantly across this space, motivating the need for the proposed flow. Our methodology features cost functions to guide the custom instruction selection process, as well as static and dynamic pruning techniques to eliminate inferior parts of the design space from consideration. Further, we employ a two-stage process, wherein a limited number of promising instruction candidates are first selected, and then evaluated in more detail through cycle-accurate instruction set simulation and synthesis of the corresponding hardware, to identify the custom instruction combinations that result in the highest program speedup or maximize speedup under a given area constraint.We have evaluated the proposed techniques using a state-of-theart extensible processor platform, in the context of a commercial design flow. Experiments with several benchmark programs indicate that custom processors synthesized using automatic custom instruction selection can result in large improvements in performance (upto 5.4X, average of 3.4X), energy (upto 4.5X, average of 3.2X), and energy-delay product (upto 24.2X, average of 12.6X), while speeding up the design process significantly.
Abstract-Efficiency and flexibility are critical, but often conflicting, design goals in embedded system design. The recent emergence of extensible processors promises a favorable tradeoff between efficiency and flexibility, while keeping design turnaround times short. Current extensible processor design flows automate several tedious tasks, but typically require designers to manually select the parts of the program that are to be implemented as custom instructions. In this work, we describe an automatic methodology to select custom instructions to augment an extensible processor, in order to maximize its efficiency for a given application program. We demonstrate that the number of custom instruction candidates grows rapidly with program size, leading to a large design space, and that the quality (speedup) of custom instructions varies significantly across this space, motivating the need for the proposed flow. Our methodology features cost functions to guide the custom instruction selection process, as well as static and dynamic pruning techniques to eliminate inferior parts of the design space from consideration. Furthermore, we employ a two-stage process, wherein a limited number of promising instruction candidates are first short-listed using efficient selection criteria, and then evaluated in more detail through cycle-accurate instruction set simulation and synthesis of the corresponding hardware, to identify the custom instruction combinations that result in the highest program speedup or maximize speedup under a given area constraint. We have evaluated the proposed techniques using a state-of-the-art extensible processor platform, in the context of a commercial design flow. Experiments with several benchmark programs indicate that custom processors synthesized using automatic custom instruction selection can result in large improvements in performance (up to 5.4 , an average of 3.4 ), energy (up to 4.5 , an average of 3.2 ), and energy-delay products (up to 24.2 , an average of 12.6 ), while speeding up the design process significantly.
The design of on-chip error correction systems for multilevel code-storage NOR flash and data-storage NAND flash memories is concerned. The concept of trellis coded modulation (TCM) has been used to design on-chip error correction system for NOR flash. This is motivated by the non-trivial modulation process in multilevel memory storage and the effectiveness of TCM in integrating coding with modulation to provide better performance at relatively short block length. The effectiveness of TCM-based systems, in terms of error-correcting performance, coding redundancy, silicon cost and operational latency, has been successfully demonstrated. Meanwhile, the potential of using strong Bose-Chaudhiri-Hocquenghem (BCH) codes to improve multilevel data-storage NAND flash memory capacity is investigated. Current multilevel flash memories store 2 bits in each cell. Further storage capacity may be achieved by increasing the number of storage levels per cell, which nevertheless will correspondingly degrade the raw storage reliability. It is demonstrated that strong BCH codes can effectively enable the use of a larger number of storage levels per cell and hence improve the effective NAND flash memory storage capacity up to 59.1% without degradation of cell programming time. Furthermore, a scheme to leverage strong BCH codes to improve memory defect tolerance at the cost of increased NAND flash cell programming time is proposed.
Although it possesses reduced computational complexity and great power saving potential, conventional adaptive Viterbi algorithm implementations contain a global best survivor path metric search operation that prevents it from being directly implemented in a high-throughput state-parallel decoder. This limitation also incurs power and silicon area overhead. This paper presents a modified adaptive Viterbi algorithm, referred to as the relaxed adaptive Viterbi algorithm, that completely eliminates the global best survivor path metric search operation. A state-parallel decoder VLSI architecture has been developed to implement the relaxed adaptive Viterbi algorithm. Using convolutional code decoding as a test vehicle, we demonstrate that state-parallel relaxed adaptive Viterbi decoders, versus Viterbi counterparts, can achieve significant power savings and modest silicon area reduction, while maintaining almost the same decoding performance and very high throughput.Index Terms-Adaptive Viterbi algorithm, low power, -algorithm, very large-scale integration (VLSI) architecture.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.