As transistors continue to become smaller, they become exponentially susceptible to permanent wearout faults. Without mitigation, these types of faults will render systems useless within unacceptably short time periods. Our work presents the design for a runtime task mapping subsystem which mitigates these faults using a wear-based heuristic. We compare our wear-based heuristic to power-and temperature-based heuristics used within the same system framework. Using a wide range of synthetic and real-world benchmarks, we show that our wear-based heuristic is able to improve total system lifetime by an average of 7.1% over temperature-based heuristics. Additionally, we show that our wear-based heuristic can be used to drastically improve the time to the first component failure (TTFF) of a system. TTFF is a metric that is of interest to designers who wish to avoid the design and verification difficulties of systems which are expected to recover after a component failure. Our wearbased heuristic improves TTFF by an average of 14.6% over temperature-based heuristics across all of our benchmarks. Our observations lead us to conclude that runtime, wearbased task mapping must be incorporated into systems for which lifetime is a primary design goal.
Temperature-aware design is emerging as a popular approach to addressing a variety of challenges, including system lifetime. In the case of task mapping, temperature-aware approaches indeed improve lifetime due to lifetime's strong dependence on temperature. However, temperature-aware design neglects several important factors that also influence lifetime: (a) physical parameters such as supply voltage and current density, as well as (b) application and architecture characteristics that affect what failures are survivable. Only lifetime-aware task mapping can expose the relationship between physical parameters, component failure, and system lifetime, and therefore find lifetime-optimal mappings.To address this need, we have developed a new lifetime-aware task mapping technique based on ant colony optimization (ACO). Our technique produces task mappings resulting in lifetimes within 17.9% of the observed optimal results on average, outperforming a lifetime-agnostic task mapping approach by an average of 32.3%. We also observed that the lifetimes resulting from task mappings within 1% of the best maximum system temperature vary by an average of 20.1% while the lifetimes resulting from task mappings within 1% of the best average system temperature vary by an average of 32.6%. Our observations lead us to conclude that one cannot depend on temperature-aware task mapping when system lifetime is a design constraint, but one may depend on lifetime-aware task mapping when one or both of lifetime and temperature are design constraints.
As semiconductor manufacturing processes scale to smaller and smaller feature sizes, manufacturing fault and permanent component failure are challenging how systems are traditionally designed. Historically, a combination of careful process tuning and design rule specification has been sufficient to cost-effectively ensure that deterministic design practices eventually result in acceptable system yield and lifetime. However, as transistors and wires shrink, they are simultaneously becoming more prone to complete or parametric failure at manufacturing time as well as degradation and total breakdown in the field, resulting in systems that are increasingly expensive to produce and less likely to function correctly for as long as intended. To address these growing challenges in system resilience, all systems-not only those intended for high-availability or mission-critical applications-must be designed with yield and lifetime in mind.This research is focused on the design-time system-level architectural optimization of cost, lifetime and yield in embedded network-on-chip-based multi-processor-systems-on-chip (NoC-based MPSoCs). At the system level, the precise nature and timing of a fault is irrelevant because the fault results in the (possibly temporary) loss of an entire processor, memory, or interconnect module regardless. One advantage of managing failure at the computer system level is therefore that once the location of a failure has been identified, the cause can be abstracted away. In this case, failures of different types may be treated the same and addressed using the same techniques. Based on this observation, we employ system-level slack -excess capacity in processor and memory nodes available to accommodate additional tasks in the event that other processors or memories are lost-as a general technique for mitigating MPSoC failure in the presence of either component manufacturing defects or permanent component failures.Given an application and fixed NoC-based communication architecture, our goal is to cost-effectively perform slack allocation, distributing execution and storage slack such that with high probability when manufacturing defects or permanent component failure occurs, sufficient resources remain for the system to continue to operate. The design space for slack allocation is large and complex. The design space consists of every possible slack allocation (up to n m for a system with n components and m possible alternatives in the component library). Furthermore, evaluating the lifetime of any single design is computationally expensive, requiring performance, power, and temperature evaluation for every possible combination of component failures. In one example we considered, an MPEG-4 decoder with 21 processors, 5 memories and 10 switches, there are 1.6 billion possible slack allocations alone (given a fixed communication architecture) and each system lifetime evaluation took from 46.4 to 144.5 seconds.To address the complexity of slack allocation, we have developed Critical Quantity Slack Alloca...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.