Leveraging hardware message passing for efficient thread synchronization

Petrović, Darko; Ropars, Thomas; Schiper, André

doi:10.1145/2692916.2555251

Cited by 3 publications

(6 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To remove the contention on locks and scale on massively parallel systems, message‐based counter technique allocates an extra thread running on a unique processor as an agent. The thread receives update requests from worker threads, and access counters on behalf of them.…”

Section: Related Workmentioning

confidence: 99%

“…Counting algorithm can be formulated as a reader‐writer problem, which can be solved by properly protecting the shared resource, counter , to make the read and write operations atomic to avoid intermediate states. Over the past decades, a variety of counting algorithms have been invented to meet requirements in algorithm efficiency, memory usage, and consistency of counting results. In the past decade, driven by Moore's Law, computer hardware and its parallelism evolved sharply.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Accurate counting algorithm for high‐speed parallel applications

Wang

Xiong

2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary Statistical counter offers the appeal of an efficient and scalable counting mechanism on multi‐core architectures where parallelism has been increasing sharply. Statistical counter has been widely used in practice (eg, in high‐end network devices to count the number of packets received) despite the truth that it can only provide weak consistency guarantee on the counting results it returns, that is, statistical counter could miscount and the returned results may be inaccurate. As hardware and its parallelism advances, the miscount issue has raised concerns in both industry and academy. This paper is motivated by this real‐world miscount issue that we were facing when building a high‐speed intrusion detection system on a commercial multi‐core server with 40Gbps NICs. To tackle the problem, we first systematically analyze the miscount issue and quantify the miscounts in counting results. Then, we present a novel counting algorithm that (1) is competitive to statistical counter in performance on multi‐core architectures and (2) provides strong consistency guarantee on counting results returned. Experiments show that it takes the new counting algorithm 10ns and 1,500ns to perform an update and a read operation, respectively. Moreover, the counting results returned are accurate.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Accurate counting algorithm for high‐speed parallel applications

Wang

Xiong

2018

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…However, for emerging many-core processors, conventional coherent cache architecture has become more and more complex and it is very hard to achieve high performance [32]. A novel architectural feature, Explicit inter-core Message Passing (EMP), has gained popularity in research and even been used in some product many-core processors, such as TILE-Gx8036 [30] and SW26010 [13]. The Sunway TaihuLight [1] supercomputer is powered by SW26010 that uses EMP instead of coherent cache to share data among cores.…”

Section: Introductionmentioning

confidence: 99%

“…Lock-free Atomic instructions Very Low High Delegation (shm version) [25] Lock-free Atomic instructions Low High Transactional Memory [29] Lock-free Transactional memory instructions Medium Conflict rate dependent Spinlock [4] Lock-based Atomic instructions High Low POSIX mutex lock Lock-based OS dependent High Medium Queue-lock [27] Lock-based Atomic instructions High Medium Delegation (EMP version) [30] Lock EMP has been used to accelerate the request sending routine in RCL [30], and the performance is improved by 4.3×. On Sunway Taihulight [13], concurrent data classification obtains a speedup of 19.15× after changing synchronization method from shared-memory based locking to EMP based delegation [24].…”

mentioning

confidence: 99%

See 1 more Smart Citation

pLock

Tang

Zhai

Qian

et al. 2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

View full text Add to dashboard Cite

Synchronization is a significant issue for multi-threaded programs. Mutex lock, as a classic solution, is widely used in legacy programs and is still popular for its intuition. The SW26010 architecture, deployed on the supercomputer Sunway Taihulight, introduces hardware-supported inter-core message passing mechanism and exposes explicit interfaces for developers to use its fast on-chip network. This emerging architectural feature brings both opportunities and challenges for mutex lock implementation. However, there is still no general lock mechanism optimized for architectures with this new feature. In this paper, we propose pLock, a fast lock designed for architectures that support Explicit inter-core Message Passing (EMP). pLock uses partial cores as lock servers and leverages the fast on-chip network to implement high-performance mutual exclusive locks. We propose two new techniques-chaining lock and hierarchical lock-to reduce message count and mitigate network congestion. We implement and evaluate pLock on an SW26010 processor. The experimental results show that our proposed techniques improve the performance of EMP-lock by up to 19.4× over a basic design. CCS Concepts • Computer systems organization → Multicore architectures; Processors and memory architectures; • Software and its engineering → Multithreading; Mutual exclusion; Concurrency control.

show abstract

Fast and Portable Locking for Multicore Architectures

Lozi

David

Thomas

et al. 2016

ACM Trans. Comput. Syst.

View full text Add to dashboard Cite

The scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. The main contribution presented in this article is a new locking technique, Remote Core Locking (RCL), that aims to accelerate the execution of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated server hardware thread. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the hardware thread acquiring the lock, because such data can typically remain in the server’s cache. Other contributions presented in this article include a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool that transforms POSIX lock acquisitions into RCL locks. Eighteen applications were used to evaluate RCL: the nine applications of the SPLASH-2 benchmark suite, the seven applications of the Phoenix 2 benchmark suite, Memcached, and Berkeley DB with a TPC-C client. Eight of these applications are unable to scale because of locks and benefit from RCL on an ×86 machine with four AMD Opteron processors and 48 hardware threads. By using RCL instead of Linux POSIX locks, performance is improved by up to 2.5 times on Memcached, and up to 11.6 times on Berkeley DB with the TPC-C client. On a SPARC machine with two Sun Ultrasparc T2+ processors and 128 hardware threads, three applications benefit from RCL. In particular, performance is improved by up to 1.3 times with respect to Solaris POSIX locks on Memcached, and up to 7.9 times on Berkeley DB with the TPC-C client.

show abstract

Leveraging hardware message passing for efficient thread synchronization

Cited by 3 publications

References 28 publications

Accurate counting algorithm for high‐speed parallel applications

Accurate counting algorithm for high‐speed parallel applications

pLock

Fast and Portable Locking for Multicore Architectures

Contact Info

Product

Resources

About