HPC runtime support for fast and power efficient locking and synchronization

Akkan, Hakan; Lang, Michael; Ionkov, Latchesar

doi:10.1109/cluster.2013.6702659

Cited by 11 publications

(9 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [15], authors proposed and implemented locking primitives using MONITOR/MWAIT instructions and showed in the evaluation using OpenMP, that the method using MONITOR/MWAIT instructions achieves higher scalability and higher energy efficiency.…”

Section: Related Workmentioning

confidence: 99%

Progression of MPI Non-blocking Collective Operations Using Hyper-Threading

Miwa

Nakashima

2015

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

View full text Add to dashboard Cite

MPI non-blocking collective operations offer a highlevel interface to MPI library users, and potentially allow communication to be overlapped with calculation. Progression, which controls communications running in the background of the calculation, is the key factor to achieve an efficient overlap. The most commonly used progression method is manual progression, in which a progression function is called in the main calculation. In manual progression, MPI library users have to estimate the communication timing to maximize the overlap effect and thus to manage the complex communication optimization. An alternative approach for progression is the use of separate communication threads. By using communication threads, communicationcalculation overlap can be achieved simply. However, context switches between the calculation thread and the communication thread cause lower performance in the frequent case where all cores are used for calculation.In this paper, we propose a novel threaded progression method using Hyper-Threading to maximize the overlap effect of non-blocking collective operations. We apply MONITOR/MWAIT instructions to the communication thread on Hyper-Threading so as not to degrade the calculation thread due to shared coreresource conflict. Evaluation on 8-node InfiniBand connected IA server clustered systems confirmed that the latency is suppressed to a small level and that our approach has an advantage over manual progression in terms of communication-calculation overlap. Using a real application of CG benchmark, our method achieved 32% reduction in execution time compared to using blocking collective operation, and that is nearly perfect overlap. Although manual progression also achieved perfect overlap, our method has the advantage that no communication timing tuning is required for each application.

show abstract

Section: Related Workmentioning

confidence: 99%

Progression of MPI Non-blocking Collective Operations Using Hyper-Threading

Miwa

Nakashima

2015

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

View full text Add to dashboard Cite

show abstract

“…There have been a few studies that focused on energy consumption due to spinning waste. One spinner lock [35] intended to resolve scalability issues and energy wastes in OpenMP has been proposed. The MONITOR/MAIT combination was applied to the user-level spinlock mechanisms used in the OpenMP framework to improve the energy efficiency of the OpenMP applications.…”

Section: Related Workmentioning

confidence: 99%

Catnap: A Backoff Scheme for Kernel Spinlocks in Many-Core Systems

Woo

Kim

et al. 2020

IEEE Access

View full text Add to dashboard Cite

As the number of cores equipped in a system grows, the impact of the spinlock waiting inside the operating system (OS) kernel on the performance and energy efficiency of a system worsens. In particular, it deteriorates the effectiveness of simultaneous multithreading (SMT). Because spinlocks are indispensable in OS kernels, it is necessary to suppress the spin wait overhead in the many-core systems. To address this issue, we propose the catnap spinlock that exploits the ACPI-C state, which is named as the catnap state and is induced by the MONITOR/MWAIT instruction pair. The catnap state releases the processor resources while deceiving the kernel that the thread is iterating a busy-waiting loop. Because entering and exiting from the C-state require considerable time, we applied the catnap loop only to the non-head waiters not to delay the lock handover operation. Furthermore, we selectively applied the catnap spinlock to the lock instances for sufficiently long critical sections based on the observation made in profiling runs. The proposed scheme was implemented in the Linux kernel and evaluated in a many-core processor system with a few workloads from the PARSEC and Re-aim benchmark suites. Our evaluation showed that the proposed scheme improved the performance by up to 33.59% and reduced energy consumption by 39.11%.

show abstract

“…MWAIT "returns" promptly after a modification of a monitored location. While waiting, the thread still occupies a CPU, but MWAIT [2,24] may allow the CPU to reach deeper sleep states. It also frees up pipeline resources more effectively than WRPAUSE.…”

Section: Unbounded Spinningmentioning

confidence: 99%

“…Our approach is palliative, but is often effective at avoiding or reducing scalability collapse, and in the worst case does no harm. Specifically, throughput is either unaffected or improved, and unfairness is bounded, relative to common test-and-set locks which allow unbounded bypass and starvation 2 . By reducing competition for shared resources, such as pipelines, processors and caches, concurrency restriction Robert Malthus [73] argued for population control, cautioning that societies would collapse as increasing populations competed for resources.…”

mentioning

confidence: 99%

Malthusian Locks

Dice

2017

Proceedings of the Twelfth European Conference on Computer Systems

View full text Add to dashboard Cite

Applications running in modern multithreaded environments are sometimes overthreaded. The excess threads do not improve performance, and in fact may act to degrade performance via scalability collapse 1 , which can manifest even when there are fewer ready threads than available cores. Often, such software also has highly contended locks. We leveropportunistically Possible counter-argument or counter-example -example of applications that might suffer from unfair CR admission. Imagine a "ragged barrier" which does not satisfy rendezvous conditions until all participating threads have completed 10 loop steps. Each step acquires and releases a contended CR-based lock within the loop. The time to reach rendezvous may be longer with a CR-based locks than with strict FIFO locks. Drafty DraftWe draw an analogy-metaphor between threads and members of the populace. We anthropomorphize threads.arrived thread * Require saturation and contention and waiting threads for a lock to be able to decide which threads will be admitted. Require surplus. * sideline; passivate; arrest; detain; sequester; deactivate; suspend; capture; * Lock lore; folk myth; received wisdom; practicum; praxis; * CR = MCSCRA8U; LIFOE3; FOXD family; FIFO = TKT; CLH; MCS; * LOITER = FOXD family MCSCR = MCSCRA8U LIFOCR = LIFOE3 * CR : mostly-LIFO admission order * STP = Spin-then-park waiting policy Candidates for the name of lock algorithm and the paper title. * Venturi Effect : restrict flow implies reduce pressure and faster flow velocity; * Performance diode -only improves; never degrades; * http://www.newyorker.com/magazine/2011/02/07/crush-point * MPL = Multiprogramming level; * Tragedy of the commons = rational maximizing behaviors by individuals results in unsustainable overexploitation of a resource. * Performance is subadditive * Slower-is-Faster phenomenon * Concurrency control overheads are ultimately proportional to contention instead of actual throughput. * MTTR = Median-time-to-reacquire * TTR is always >= waiting time, measured in acquisitions. * Myriad ways exist to measure short-and long-term fairness. * Under an ideally fair FIFO lock, admission order corresponds perfectly with arrival order. * Define unfairness as : how admission order deviates from FIFO or how admission order deviates from arrival order. * LWSS Window size W should be larger than maximum number of concurrent participating threads.CR acts to reduce the number of distinct threads circulating through the lock over short intervals and thus tends to reduce the LWSS, while still providing long-term fairness. The CR admission policy must also be work conserving and never under-provision the lock. It should never be the case that the critical section remains intentionally unoccupied if there are waiting or arriving threads that might enter -if such threads exist, then one will promptly be enabled to do so.If arriving or waiting threads might enter the critical section, then one will be promptly enabled to do so. * The admission policy for an optimal CR implementation attempts to...

show abstract

HPC runtime support for fast and power efficient locking and synchronization

Cited by 11 publications

References 6 publications

Progression of MPI Non-blocking Collective Operations Using Hyper-Threading

Progression of MPI Non-blocking Collective Operations Using Hyper-Threading

Catnap: A Backoff Scheme for Kernel Spinlocks in Many-Core Systems

Malthusian Locks

Contact Info

Product

Resources

About