As a consequence of technology scaling, today's complex multi-processor systems have become more and more susceptible to errors. In order to satisfy reliability requirements, such systems require methods to detect and tolerate errors. This entails two major challenges: (a) providing a comprehensive approach that ensures fault-tolerant execution of parallel applications across different types of resources, and (b) optimizing resource usage in the face of dynamic fault probabilities or with varying fault tolerance needs of different applications. In this paper, we present a holistic and adaptive approach to provide fault tolerance on Multi-Processor System-on-a-Chip (MPSoC) on demand of an application or environmental needs based on invasive computing. We show how invasive computing may provide adaptive fault tolerance on a heterogeneous MPSoC including hardware accelerators and communication infrastructure such as a Network-on-Chip (NoC). In addition, we present (a) compile-time transformations to automatically adopt well-known redundancy schemes such as Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR) for fault-tolerant loop execution on a class of massively parallel arrays of processors called as Tightly Coupled Processor Arrays (). Based on timing characteristics derived from our compilation flow, we further develop (b) a reliability analysis guiding the selection of a suitable degree of fault tolerance. Finally, we present (c) a methodology to detect and adaptively mitigate faults in invasive NoCs.
The sustained advance in technology will enable integrating hundreds of processing cores on a single die in near future. However, it already can be foreseen that the management of the resources of such large systems will not scale in the same way as the hardware using todays entirely software based and centralized management approaches. The invasive paradigm addresses this problem and proposes concepts to enable resource awareness and scalability -especially focusing the resource management perspective -in future multicore systems. These concepts are based on distributed and software-hardware partitioned resource management strategies. High level management decision that are made by software thereby trigger lower level management strategies that are autonomously carried out in hardware. Sufficiently accurate modeling of the overall invasive system is required to study and optimize such a decentralized, software-hardware partitioned control loop where decisions significantly depend on runtime dynamic effects. Software based simulation cannot deliver the required speed or accuracy making FPGA based prototyping of invasive systems necessary. This paper describes our prototyping concepts and discusses possible implementation alternatives for invasive multicore architectures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.