Most existing algorithms for parallel or distributed reduction operations are not able to handle temporary or permanent link and node failures. Only recently, methods were proposed which are in principal capable of tolerating link and node failures as well as soft errors like bit flips or message loss. A particularly interesting example is the pushflow algorithm. However, on closer inspection, it turns out that in this method the failure recovery often implies severe performance drawbacks. Existing mechanisms for failure handling may basically lead to a fall-back to an early stage of the computation and consequently slow down convergence or even prevent convergence if failures occur too frequently. Moreover, state-of-the-art fault tolerant distributed reduction algorithms may experience accuracy problems even in failure free systems.We present the push-cancel-flow (PCF) algorithm, a novel algorithmic enhancement of the push-flow algorithm. We show that the new push-cancel-flow algorithm exhibits superior accuracy, performance and fault tolerance over all other existing distributed reduction methods. Moreover, we employ the novel PCF algorithm in the context of a fully distributed QR factorization process and illustrate that the improvements achieved at the reduction level directly translate to higher level matrix operations, such as the considered QR factorization.
The construction of distributed algorithms for matrix computations built on top of distributed data aggregation algorithms with randomized communication schedules is investigated. For this purpose, a new aggregation algorithm for summing or averaging distributed values, the push-flow algorithm, is developed, which achieves superior resilience properties with respect to node failures compared to existing aggregation methods. On a hypercube topology it asymptotically requires the same number of iterations as the optimal all-to-all reduction operation and it scales well with the number of nodes. Orthogonalization is studied as a prototypical matrix computation task. A new fault tolerant distributed orthogonalization method (rdmGS), which can produce accurate results even in the presence of node failures, is built on top of distributed data aggregation algorithms.
Over the last years, several gossip-based aggregation algorithms have been developed which focus on providing resilience in failure-prone distributed systems. The main objective of such algorithms is the efficient in-network computation of aggregates even in the case when system failures occur during runtime. In this paper, we evaluate performance and limitations in practical computations of those gossip-based aggregation algorithms with the most promising theoretical fault tolerance properties.Theoretical analyses of these algorithms usually address only the principal ability of handling or overcoming a certain kind of system failure. Most of the time, there are no formal results on the concrete impact of failure handling on the performance of the algorithms, e. g., in terms of convergence speed. This leaves a wide gap between theory and practice, as we illustrate in this paper. In order to bridge this gap, we first categorize common system failures of interest. Then, we experimentally investigate how well these common failure types are handled in practice by the considered algorithms and up to which extent these state-of-the-art methods provide a reasonable degree of fault tolerance in practice. Our experimental studies reveal (i) that certain failure handling approaches which work in theory exhibit unacceptable performance in practice and (ii) that in some cases the failure handling mechanisms used introduce new problems, e. g., numerical inaccuracy.Our investigations illustrate that for some failure types (such as permanent node failures) further algorithmic advances are required to achieve resilience with a reasonably small overhead and acceptable performance.
Highlights► We present the push-flow algorithm (PF), a new distributed data aggregation algorithm (DDAA). ► PF has better resilience properties than previously existing DDAAs. ► PF has very good asymptotic scaling behavior on hypercube topologies. ► Based on DDAAs, we design the new rdmGS algorithm for orthogonalizing a set of vectors in a decentralized distributed fashion. ► rdmGS is capable of producing fully accurate results even if several nodes fail permanently.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.