Abstract-Recent studies indicate that the number of system failures caused by main memory errors is much higher than expected. In contrast to the commonly used hardware-based countermeasures, for example using ECC memory, softwarebased fault-tolerance measures are much more flexible and can exploit application knowledge, such as the criticality of specific data structures. This paper presents a software-based memory error protection approach, which we used to harden the eCos operating system in a case study. The main benefits of our approach are the flexibility to choose from an extensible toolbox of easily pluggable error detection and correction schemes as well as its very low runtime overhead, which totals in a range of 0.09-1.7 %. The implementation is based on aspect-oriented programming and exploits the object-oriented program structure of eCos to identify well-suited code locations for the insertion of generative fault-tolerance measures.
Recent studies indicate that transient memory errors (soft errors) have become a relevant source of system failures. This paper presents a generic software-based fault-tolerance mechanism that transparently recovers from memory errors in object-oriented program data structures. The main benefits are the flexibility to choose from an extensible toolbox of easily pluggable error detection and correction schemes, such as Hamming and CRC codes. This is achieved by a combination of aspect-oriented and generative programming techniques. Furthermore, we present a wait-free synchronization algorithm for error detection in data structures that are used concurrently by multiple threads of control. We give a formal correctness proof and show the excellent scalability of our approach in a multiprocessor environment. In a case study, we present our experiences with selectively hardening the eCos operating system and its benchmark suite. We explore the trade-off between resiliency and performance by choosing only the most vulnerable data structures for error recovery. Thereby, the total number of system failures, manifesting as silent data corruptions and crashes, is reduced by 69.14 % at a negligible runtime overhead of 0.36 %.
Abstract-Developers of embedded (real-time) systems can choose from a variety of operating systems. While some embedded operating systems provide very flexible APIs, e.g., a POSIXcompliant interface for run-time management, others have a completely static structure, which is generated at compile time by utilizing detailed application knowledge. A prominent example for the latter class from the domain of automotive operating systems is OSEK/OS and its successor AUTOSAR/OS. As we have shown in previous work, the design of the operating system has a strong impact on its vulnerability for system failure caused by hardware faults. This observation is gaining importance, because there is an ongoing trend towards low-power and low-cost, yet less reliable, hardware. This work quantifies the difference in vulnerability for soft errors in main memory of a flexible (dynamic) operating systems (eCos) and a static system (CiAO), which has an OSEK-compliant structure. We also analyze the additional degree of robustness that is achieved by hardening an operating system with software-based and hardware-based faulttolerance measures and the corresponding costs. Covering this design space gives developers a better chance for good design decisions with respect to the trade-off between fault tolerance, resource consumption, and interface convenience. Our results indicate that with a combination of hardware-and softwarebased fault-tolerance measures, silent data corruptions in both operating systems can be reduced to below one percent (compared to eCos). However, the analyzed fault-tolerance mechanisms are expensive for the dynamic system, whereas the statically designed operating system can be hardened at much lower price.
This paper presents a multi-layer software reliability approach that leverages multiple software layers (e. g., programming language, compiler, and operating system) to improve the overall system reliability considering unreliable or partly-reliable hardware. We present a comprehensive design flow that integrates multiple software layers while accounting for the knowledge from lower hardware layers. We show how multiple software layers synergistically operate to achieve a high degree of reliability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.