A Distributed Shared Memory (DSM) system provides a distributed application with a shared virtual address space. This article proposes a design for implementing the DSM communication layer on top of the Virtual Interface Architecture (VIA), an industry standard for user-level networking protocols on high-speed clusters. User-level communication protocols operate in user mode, thus removing the operating system kernel's overhead from the critical communication pass, and significantly diminishing communication overhead as a result. We analyze VIA's facilities and limitations in order to ascertain which implementation trade-offs can be best applied to our development of an efficient communication substrate optimized for DSM requirements. We then implement a multithreaded version of the Home-based Lazy Release Consistency (HLRC) protocol on top of this substrate. In addition, we compare the performance of this HLRC protocol with that of the Sequential Consistency (SC) protocol in which a MULTIVIEW (MV) memory mapping technique was used. This technique enables a fine-grained access to shared memory, while still relying on the virtual memory hardware to track memory accesses. We perform an 'apple-toapple' comparison on the same testbed environment and benchmark suite, and investigate the effectiveness and scalability of both protocols. ; 35:755-786 SOFTWARE DISTRIBUTED SHARED MEMORY 757 synchronization points are reached; that is, between these synchronization points, the shared memory may appear inconsistent to different processors. These alternate models guarantee, for properlylabeled [4] programs, results equivalent to those of a sequentially consistent system. Informally, a program is properly labeled if the program contains enough synchronization to avoid data races. Synchronization operations are divided into ACQUIRE and RELEASE operations, used respectively to obtain and yield exclusive access to shared data. These operations can be thought of as standard lock operations.Lazy Release Consistency (LRC) [5,6] is a refinement of the Release Consistency (RC) model [4]. The RC model requires that shared memory accesses be performed globally upon a RELEASE operation only. The idea of LRC is to make those accesses visible only to the processor that acquires a lock rather than perform all operations globally. A home-based implementation of LRC (HLRC) was proposed by Iftode [10]. In this implementation each shared page has an assigned home node that always hosts the most updated contents of the page. These updated contents may be fetched by a non-home node that needs an updated version.
ContributionThis work compares the runtime performance of two multithreaded memory coherence protocols: a multithreaded implementation of the HLRC model and an efficient multithreaded implementation of the SC model that uses a MULTIVIEW [11,12] memory mapping technique. We also examine and compare the scalability of these protocols to a multithreaded mode of execution. Previous studies proposed non-preemptive multithreading [13] or creati...