The probability of failures in software distributed shared memory (SDSM) increases as the system size grows. This paper introduces a new, efficient message logging technique, called the coherence-centric logging (CCL) and recovery protocol, for home-based SDSM. Our CCL minimizes failure-free overhead by logging only data necessary for correct recovery and tolerates high disk access latency by overlapping disk accesses with coherence-induced communication existing in home-based SDSM, while our recovery reduces the recovery time by prefetching data according to the future shared memory access patterns, thus eliminating the memory miss idle penalty during the recovery process. To the best of our knowledge, this is the very first work that considers crash recovery in home-based SDSM.We have performed experiments on a cluster of eight SUN Ultra-5 workstations, comparing our CCL against traditional message logging (ML) by modifying TreadMarks, a state-of-the-art SDSM system, to support the home-based protocol and then implementing both our CCL and the ML protocols in it. The experimental results show that our CCL protocol consistently outperforms the ML protocol: Our protocol increases the execution time negligibly, by merely 1% to 6%, during failure-free execution, while the ML protocol results in the execution time overhead of 9% to 24% due to its large log size and high disk access latency. Our recovery protocol improves the crash recovery speed by 55% to 84% when compared to re-execution, and it outperforms ML-recovery by a noticeable margin, ranging from 5% to 18% under parallel applications examined.
Fault-tolerant techniques that can cope with system failures in software distributed shared memory (SDSM) are essential for creating productive and highly available parallel computing environments on clusters of workstations. In this paper, we propose a new, efficient coordinated checkpointing technique, called coherence-based coordinated checkpointing (CCC), for SDSM. Our CCC minimizes both the checkpointing overhead during failure-free execution and the cost of recovery from failures by leveraging existing coherence information maintained by SDSM. In the presence of system failures, it allows SDSM to recover from the most recent checkpoint, saving the re-computation time. We have performed experiments on a cluster of eight Sun Ultra-5 workstations, comparing our CCC technique against both simple coordinated checkpointing (SCC) and incremental coordinated checkpointing (ICC) techniques by actually implementing these techniques in TreadMarks, a state-of-the-art SDSM system. The experimental results demonstrate that our CCC technique consistently outperforms both SCC and ICC techniques. In particular, our technique increases the execution time slightly by 0.5% to 4% for a 2-minute checkpointing interval during failurefree execution, while SCC and ICC techniques result in the execution time overhead of 4% to 100% and 3% to 64%, respectively, for the same checkpointing interval.
This paper introduces an efficient barrier synchronization algorithm based on the binomial spanning tree (BST) and proposes a data transfer reduction technique for distributed shared memory systems under release consistency. The introduced BST-based barrier algorithm parallelizes and distributes the workload amongs participating processors, alleviating network contention and yielding less retransmission.As a result, performance improves, and the degree of improvement increases quickly as the number of participants grows. Our barrier algorithm and data transfer reduction technique are incorporated in TreadMarks for evaluation using various benchmarks on a network of workstations and the IBM SP machine. Experimental results are gathered and demonstrated. Introduct ionDistributed memory systems usually exhibit poor programmability and portability, because all data partitioning and explicit communication must be done by the programmer. The distributed shared memory (DSM) has emerged to overcome this difficulty by providing a global, single address space on top of physically distributed memory systems. DSM combines the ease of shared memory programming paradigm with the scalability and constructability of distributed memory systems, such as the network of workstations (NOW) and distributed memory multiprocessors. Due to its potential advantages, DSM has been an active research area, with many prototype systems implemented and demonstrated [ll, 131.The address space of a DSM system is distributed across memories at interconnected processors. To reduce traffic over the network, a replication of certain data stored at a remote processor is usually created in the local cache of a processor. Multiple copies of data, F'ermission to make digital.lllhnrd copin of all or pars ofthis material for PemOnd or classroom use is granted withoul fee provided that the copies are not made or distributed for profit or commercial advantage, the copy right notice, the title ofthe publication and its date appear, and notice is given that copyright is by permkion of the ACM, Inc. TO copy otherwise, lo republish, to posl on sewers or IO redistribute to IisLq, requires specific permission and/or fee KC; 97 Vienna AustriaCopyright 1997 ACM O-89791-902-5/97#..$3.50while improving read performance, pose the need of coherence enforcement, which can be achieved through hardware, software, or a combination of the two [13]. Apparently, the software implementation is most attractive since it involves no added hardware, which tends to be machine-dependent and expensive. The data transfer due to coherent enforcement under a software implementation, however, has to be minimized. To this end, the release consistency (RC) model [5] is a preferred choice for a software DSM implementation due to its low coherence traffic. RC does not guarantee that shared memory is consistent all of the time, but rather making sure consistence only after synchronization operations. This naturally results in lower communication traffic due to coherence than a more restrictive...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.