SUMMARYWe propose a high-availability server platform for communication software. When the server had a hardware failure, the system must be exchanged for the recovery, but when it had a software failure, the service downtime can be greatly reduced by recovering while initializing only the processes involved. The effectiveness of the platform was verified by applying it to three types of real life systems. In concrete terms, compared to before applying it, the service downtime was halved when initializing all of the processes. Also, by dividing the servers into domains and having redundancy in the processes, the downtime was further reduced 40% (total over 70% reduction). We also verified that if an Individual process failed, it could be recovered by restarting only that process and the other processes could continue service. The aforementioned recovery process can be executed by describing a Restart definition in a simple format without modifying the application programs. If a failed process is detected, it analyzes the Restart definition, selects the required restart phase, and executes it. By categorizing the features of the processes constituting communication software, we defined four types of restart phase: Individual restart, Group restart, All AP restart, and All restart. We also implemented state transitions that shift to higher phase restart, when the failure recovery was unsuccessful.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.