NAND flash storage has proven to be a competitive alter native to traditional disk for its properties of high random access speeds, low-power and its presumed efficacy for random-reads. Ironically, we demonstrate that when pack aged in SSD format, there arise many barriers to reaching full parallelism in reads, resulting in random writes out peiforming them. Motivated by this, we propose Physically Addressed Queuing (PAQ), a request scheduler that avoids resource contention resultant from shared SSD resources. PAQ makes the following major contributions: First, it ex poses the physical addresses of requests to the scheduler. Second, I/O clumping is utilized to select groups of op erations that can be simultaneously executed without ma jor resource conflict. Third, inter-request NAND transac tion packing empowers multi-plane-mode operations. We implement PAQ in a cycle-accurate simulator and demon strate bandwidth and lOP S improvements greater than 62% and latency decreases as much as 41.6% for random reads, without degrading peiformance of other access types.
Abstract-With the increasing scaling of manufacturing technology, process variation is a phenomenon that has become more prevalent. As a result, in the context of Chip Multiprocessors (CMPs) for example, it is possible that identically-designed processor cores on the chip have non-identical peak frequencies and power consumptions. To cope with such a design, each processor can be assumed to run at the frequency of the slowest processor, resulting in wasted computational capability. This paper considers an alternate approach and proposes an algorithm that intelligently maps (and remaps) computations onto available processors so that each processor runs at its peak frequency. In other words, by dynamically changing the thread-to-processor mapping at runtime, our approach allows each processor to maximize its performance, rather than simply using chip-wide lowest frequency amongst all cores and highest cache latency. Experimental evidence shows that, as compared to a process variation agnostic thread mapping strategy, our proposed scheme achieves as much as 29% improvement in overall execution latency, average improvement being 13% over the benchmarks tested. We also demonstrate in this paper that our savings are consistent across different processor counts, latency maps, and latency distributions.With the increasing scaling of manufacturing technology, process variation is a phenomenon that has become more prevalent. As a result, in the context of Chip Multiprocessors (CMPs) for example, it is possible that identicallydesigned processor cores on the chip have non-identical peak frequencies and power consumptions. To cope with such a design, each processor can be assumed to run at the frequency of the slowest processor, resulting in wasted computational capability. This paper considers an alternate approach and proposes an algorithm that intelligently maps (and remaps) computations onto available processors so that each processor runs at its peak frequency. In other words, by dynamically changing the thread-to-processor mapping at runtime, our approach allows each processor to maximize its performance, rather than simply using chip-wide lowest frequency amongst all cores and highest cache latency. Experimental evidence shows that, as compared to a process variation agnostic thread mapping strategy, our proposed scheme achieves as much as 29% improvement in overall execution latency, average improvement being 13% over the benchmarks tested. We also demonstrate in this paper that our savings are consistent across different processor counts, latency maps, and latency distributions.
Abstract-Compilers designed for current embedded systems must be capable of addressing multiple constraints such as low power, high performance, small memory footprint and form factor, and high reliability at the same time. In particular, optimizing for one constraint should be performed carefully, considering its impact on other constraints. Recent trends indicate that transient errors are becoming increasingly important in embedded systems. Focusing on an embedded chip multiprocessor and array-intensive applications, this paper demonstrates how reliability against transient errors can be improved without impacting execution time by utilizing idle processors for duplicating some of the computations of the active processors. It also shows how a balance between power savings and reliability improvement can be struck using a metric called the energy-delay-fallibility product. Our experimental results indicate that the "percentage of duplicated computations" is a useful high-level metric for studying the tradeoffs among performance, power, and reliability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.