A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems

Kim, Yonghwan; Araragi, Tadashi; Nakamura, Junya; Masuzawa, Toru

doi:10.1587/transinf.e97.d.65

Cited by 4 publications

(12 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we evaluate the performance of the proposed algorithm with the CSS algorithm. 10,20 The CSS algorithm is a representative of partial snapshot algorithms, as described in Section 2, and the two algorithms have the same properties: (1) The algorithms do not suspend an application execution on a distributed system while taking a snapshot, (2) the algorithms take partial snapshots (not snapshots of the entire system), ( 3) the algorithms can take multiple snapshots concurrently, and (4) the algorithms can handle dynamic network topology changes. In addition, both algorithms are based on the SSS algorithm.…”

Section: Discussionmentioning

confidence: 99%

“…In contrast, SSS algorithm allows execution of any applications while a snapshot is taken, with some elaborate operations based on the communication-relation. Kim et al, proposed a new partial snapshot algorithm, named Concurrent Sub-Snapshot (CSS) algorithm [11,21], based on SSS algorithm. They called the problematic situation caused by the overlap of the subsystems a collision and presented an algorithm that can resolve collisions by combining colliding SSS algorithm instances.…”

Section: Related Workmentioning

confidence: 99%

“…Before showing the simulation results, we briefly explain CSS algorithm. For details, please refer the original paper [21].…”

Section: Css Algorithm Summarymentioning

confidence: 99%

See 2 more Smart Citations

A cooperative partial snapshot algorithm for checkpoint‐rollback recovery of large‐scale and dynamic distributed systems and experimental evaluations

Nakamura

Kim

Katayama

et al. 2020

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

A distributed system consisting of a huge number of computational entities is prone to faults, because faults in a few nodes cause the entire system to fail. Consequently, fault tolerance of distributed systems is a critical issue. Checkpoint-rollback recovery is a universal and representative technique for fault tolerance; it periodically records the entire system state (configuration) to non-volatile storage, and the system restores itself using the recorded configuration when the system fails. To record a configuration of a distributed system, a specific algorithm known as a snapshot algorithm is required. However, many snapshot algorithms require coordination among all nodes in the system; thus, frequent executions of snapshot algorithms require unacceptable communication cost, especially if the systems are large. As a sophisticated snapshot algorithm, a partial snapshot algorithm has been introduced that takes a partial snapshot (instead of a global snapshot). However, if two or more partial snapshot algorithms are concurrently executed, and their snapshot domains overlap, they should coordinate, so that the partial snapshots (taken by the algorithms) are consistent. In this paper, we propose a new efficient partial snapshot algorithm with the aim of reducing communication for the coordination. In a simulation, we show that the proposed algorithm drastically outperforms the existing partial snapshot algorithm, in terms of message and time complexity.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A cooperative partial snapshot algorithm for checkpoint‐rollback recovery of large‐scale and dynamic distributed systems and experimental evaluations

Nakamura

Kim

Katayama

et al. 2020

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…For the practical implementation of the snapshot protocol, the system model must consider failures and asynchronous communication [27]. In [28], a partial snapshot algorithm for a subsystem, where multiple nodes concurrently initiate the snapshot algorithm, is proposed. In Snapify [29], a snapshot algorithm for offload applications on Xeon Phi manycore processors is proposed.…”

Section: Snapshot Protocolsmentioning

confidence: 99%

“…Algorithm 1 shows the pseudocode of the proposed distributed snapshot algorithm for the active thread. Before starting a round, node i checks whether a consistent global state is collected for failedRound (lines [16][17][18][19][20][21][22][23][24][25][26][27][28]. If the stateNodes data structure satisfies the conditions of the GS, node i saves the stateNodes data structure to latestSnapshot and builds the stateChannel data structure (lines 17-21).…”

Section: Details Of the Algorithmsmentioning

confidence: 99%

A Distributed Snapshot Protocol for Efficient Artificial Intelligence Computation in Cloud Computing Environments

Lim

Gil

2018

Symmetry

View full text Add to dashboard Cite

Many artificial intelligence applications often require a huge amount of computing resources. As a result, cloud computing adoption rates are increasing in the artificial intelligence field. To support the demand for artificial intelligence applications and guarantee the service level agreement, cloud computing should provide not only computing resources but also fundamental mechanisms for efficient computing. In this regard, a snapshot protocol has been used to create a consistent snapshot of the global state in cloud computing environments. However, the existing snapshot protocols are not optimized in the context of artificial intelligence applications, where large-scale iterative computation is the norm. In this paper, we present a distributed snapshot protocol for efficient artificial intelligence computation in cloud computing environments. The proposed snapshot protocol is based on a distributed algorithm to run interconnected multiple nodes in a scalable fashion. Our snapshot protocol is able to deal with artificial intelligence applications, in which a large number of computing nodes are running. We reveal that our distributed snapshot protocol guarantees the correctness, safety, and liveness conditions.

show abstract

Region-based Sub-Snapshot (RegSnap): Enhanced Fault Tolerance in Distributed Stream Processing with Partial Snapshot

Takdir

Kitagawa

Amagasa

2022

2022 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

A Concurrent Partial Snapshot Algorithm for Large-Scale and Dynamic Distributed Systems

Cited by 4 publications

References 17 publications

A cooperative partial snapshot algorithm for checkpoint‐rollback recovery of large‐scale and dynamic distributed systems and experimental evaluations

A cooperative partial snapshot algorithm for checkpoint‐rollback recovery of large‐scale and dynamic distributed systems and experimental evaluations

A Distributed Snapshot Protocol for Efficient Artificial Intelligence Computation in Cloud Computing Environments

Region-based Sub-Snapshot (RegSnap): Enhanced Fault Tolerance in Distributed Stream Processing with Partial Snapshot

Contact Info

Product

Resources

About