As semiconductor manufacturing processes scale to smaller and smaller feature sizes, manufacturing fault and permanent component failure are challenging how systems are traditionally designed. Historically, a combination of careful process tuning and design rule specification has been sufficient to cost-effectively ensure that deterministic design practices eventually result in acceptable system yield and lifetime. However, as transistors and wires shrink, they are simultaneously becoming more prone to complete or parametric failure at manufacturing time as well as degradation and total breakdown in the field, resulting in systems that are increasingly expensive to produce and less likely to function correctly for as long as intended. To address these growing challenges in system resilience, all systems-not only those intended for high-availability or mission-critical applications-must be designed with yield and lifetime in mind.This research is focused on the design-time system-level architectural optimization of cost, lifetime and yield in embedded network-on-chip-based multi-processor-systems-on-chip (NoC-based MPSoCs). At the system level, the precise nature and timing of a fault is irrelevant because the fault results in the (possibly temporary) loss of an entire processor, memory, or interconnect module regardless. One advantage of managing failure at the computer system level is therefore that once the location of a failure has been identified, the cause can be abstracted away. In this case, failures of different types may be treated the same and addressed using the same techniques. Based on this observation, we employ system-level slack -excess capacity in processor and memory nodes available to accommodate additional tasks in the event that other processors or memories are lost-as a general technique for mitigating MPSoC failure in the presence of either component manufacturing defects or permanent component failures.Given an application and fixed NoC-based communication architecture, our goal is to cost-effectively perform slack allocation, distributing execution and storage slack such that with high probability when manufacturing defects or permanent component failure occurs, sufficient resources remain for the system to continue to operate. The design space for slack allocation is large and complex. The design space consists of every possible slack allocation (up to n m for a system with n components and m possible alternatives in the component library). Furthermore, evaluating the lifetime of any single design is computationally expensive, requiring performance, power, and temperature evaluation for every possible combination of component failures. In one example we considered, an MPEG-4 decoder with 21 processors, 5 memories and 10 switches, there are 1.6 billion possible slack allocations alone (given a fixed communication architecture) and each system lifetime evaluation took from 46.4 to 144.5 seconds.To address the complexity of slack allocation, we have developed Critical Quantity Slack Alloca...