This paper presents a dependability study of high-speed, switched Local Area Networks (LANs) using Myrinet as an example testbed (with theoretical speeds of 2.56 Gbps). The study uses results of two fault injection methods, simulated fault injection and softwareimplemented fault injection (SWIFI), to analyze the application-level impact of transient faults injected into the network interface hardware. These results include a number of errors such as dropped or corrupt messages, host interface or host resets, and local or remote host interface hangs. The paper presents the study in two parts: First, the results from the SWIFI method in the real system are used as a basis to validate the simulation and identify the major factors leading to di erences between the methods. A comparison between the two injection methods shows that they agree for 83% of the fault injections. The results, however, vary greatly depending on the fault type considered. The study presents an analysis of the e ects of varying workload intensity, host platform, and interface function targeted by the injection. An example of this analysis is to show that the function targetted has a signi cant impact on the fault activation rate. Finally, the study identi es two mechanisms by which faults may propagate from the interface to other parts of the network; in one example, this propagation caused the interface's host computer to reboot while another caused a remote interface in the network to hang.
This paper presents an injection-based approach to analyze dependability of high-speed networks using the Myrinet as a n example testbed. Instead of injecting faults related to network protocols, we injected faults into the host interface component, which performs the actual send and receive operations. The fault model used was a temporary single bit flip in an instruction executing o n the host interface's custom processor, corresponding to a transient fault in the processor itself. Results show that more than 25% of the injected faults resulted in interface failures. Furthermore, we observed fault propagation from a n interface to its host computer or to another interface to which it sent a message. These findings suggest that two important issues for high-speed networking in critical applications are protecting the host computer from errant or malicious interface components and implementing thorough message acceptance test mechanisms to prevent errant messages from propagating faults between interfaces.
This paper presents a hierarchical simulation methodology that enables accurate system evaluation under realistic faults and conditions. In this methodology effects of low-level (i.e., transistor or circuit levels) faults are propagated to higher levels (i.e., system level) using fault dictionaries. The primary fault model is obtained via simulation of the transistor-level effect of a radiation particle penetrating the device.
The resulting current burst is used as a fault model in the circuit-level simulation and is injected into the nodes of a circuit/subcircuit. The latched outputs are collected in a fault dictionary and applied in conducting fault injection at the chip level under a selected workload. Faults injected at the chip level result in memory corruption, which is used as a fault model in the system-level simulation. When an application terminates, either normally or abnormally, the overall fault impact on the software behavior is quantified and analyzed.The simulation method is demonstrated and validated in the case study of Myrinet, a commercial, high-speed network. The study shows that the proposed approach offers a high confidence in the evaluation results, as the system is analyzed in presence of realistic fault conditions. It also demonstrates that the conducted analysis can be used to improve system dependability by identifying recovery mechanisms for failures observed during the experiments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.