This three-part paper analyzes existing approaches and methods of organizing failure- and fault-tolerant computing in distributed multicomputer systems (DMCS), identifies and provides rationale for a list of issues to be solved. We review the application areas of failure- and fault- tolerant control systems for complex network and distributed objects. The second part further investigates the issues of organizing failure- and fault- tolerance in the DMCS. The systemic, functional, and test diagnostics are viewed as the basis for building unattended failure- and fault-tolerant systems. We introduce the concept of self-managed degradation (when the DMCS eventually proceeds to a safe shutdown at a critical level of degradation) as a means to increase the DMCS active life.
The paper deals with the organization of target work recovery processes after admissible failures and faults in an automatic failure and fault tolerant multitask distributed multi-machine system of the network structure performing a set of the target functions set by external users. The system is characterized by parallel execution of a set of interacting target tasks performed on separate computer subsystems, which are organized sets of digital computers. The specified level of failure- and fault-tolerance of the task is provided by its replication, i.e. parallel execution of copies of this task on several computers that make up the system, with the exchange of results and the choice of the correct one.
The study introduces the characteristics, principles of construction, features of the considered systems and their "philosophical" essence from the point of view of failure- and fault-tolerance. Within the research, we determined the factors of complexity in the design of failure- and fault-tolerant systems of this class. The most general model of malicious computer failure is adopted, in which the computer behavior can be arbitrary, different in relation to other computers interacting with it, and even as malicious. We focus on the part of the problem of organizing dynamic redundancy in the developed system. The problem arises after an acceptable set of faults is detected in this system in some complex (or some set of F complexes) by each of the fault-free digital computers of each such complex and each such fault is also synchronously and consistently identified by place of origin and by type as a software failure of a certain digital computer of this complex. This part of the problem is solved by restoring all necessary information identified in a state of software malfunction of a certain complex. The information is transmitted to this digital computer from fault-free digital computers of this complex. The list of instructions required for such a recovery, as well as the actions of the complex in the recovery process, is determined.
An algorithm was proposed to extract, if possible, from an arbitrary-structure network an environment of mutual informational system agreement including the problem solution complexes and the environment of intercomplex exchange. Such agreement is required in the computer network to organize parallel solution of interacting problems with given characteristics of their fault-tolerance based on the dynamic redundance. The proposed approach, methods, and algorithms are applicable to the distributed systems, as well as the grid and "cloud" computations for organization of computations with the desired confidence.
The paper centers on the problems of developing failure- and fault-tolerant systems for Earth remote sensing satellite constellation control. The study defines the concept of a complex that fail-safely performs a target task, in this case, the task of detecting a target event and monitoring its behavior and development, i.e. monitoring the target event, and gives a hierarchical satellite constellation structure. Findings of the research show that it is necessary to use dynamic redundancy, which can significantly increase the trajectory of self-controlled degradation and, accordingly, the satellite constellation active life. The complexity of the problem lies in ensuring the reliability of the results obtained when a large number of target events, both natural and man-caused, occur. The study introduces an approach to reduce hardware redundancy, i.e. monitor a larger number of events using a lower power satellite constellation, and proves that is possible to use the approach without losing the system reliability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.