Fence-free work stealing on bounded TSO processors

Morrison, Adam; Afek, Yehuda

doi:10.1145/2541940.2541987

Cited by 15 publications

(21 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Targeting scheduling systems for task-based programs, a large amount of prior work aims to improve energy-efficiency [38,41], to improve data locality [9,10], or to reduce scheduling overhead [17,29]. However, with the increasing bandwidth requirements of computing tasks, many papers have also conducted related research for efficient bandwidth usage.…”

Section: Related Workmentioning

confidence: 99%

Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory

Zhao

Chen

Qiu

et al. 2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Parallel computers now start to adopt Bandwidth-Asymmetric Memory architecture that consists of traditional DRAM memory and new High Bandwidth Memory (HBM) for high memory bandwidth. However, existing task schedulers suffer from low bandwidth usage and poor data locality problems in bandwidthasymmetric memory architectures. To solve the two problems, we propose a Bandwidth and Locality Aware Task-stealing (BATS) system, which consists of an HBM-aware data allocator, a bandwidth-aware traffic balancer, and a hierarchical task-stealing scheduler. Leveraging compile-time code transformation and run-time data distribution, the data allocator enables HBM usage automatically without user interference. According to data access hotness, the traffic balancer migrates data to balance memory traffic across memory nodes proportional to their bandwidth. The hierarchical scheduler improves data locality at runtime without a priori program knowledge. Experiments on an Intel Knights Landing server that adopts bandwidth-asymmetric memory show that BATS reduces the execution time of memory-bound programs up to 83.5% compared with traditional task-stealing schedulers.

show abstract

Section: Related Workmentioning

confidence: 99%

Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory

Zhao

Chen

Qiu

et al. 2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…We further adapt the echo method [29] to make non-owner lock acquisition speed comparable to standard locks, assuming the owner acquires the lock frequently. Because our lock does not rely on blocking safe points, it (1) can be used in C/C++ programs, which do not naturally define safe points, and (2) enables non-owner acquisition even if the owner is scheduled out or delayed.…”

Section: Safe Memory Reclamation ( § 4)mentioning

confidence: 99%

“…This signals to T 1 that T 0 is waiting to acquire L, so T 1 can stop the ∆ delay and enter the critical section. To implement this notification we use echoing [29]: We expand the flags to 64-bits, 63 of which are used as version numbers that uniquely identify each writewhenever T 1 writes to flag 1 , it increases flag 1 's version. T 0 uses this version to notify T 1 that it is spinning while trying to acquire L, by writing-or echoing-what it reads from flag 1 into flag 0 (Lines 59-63).…”

Section: Ffbl Algorithmmentioning

confidence: 99%

“…Thus, in hazard pointers for example, we only add work to the reclamation path and not to every object retirement. (In contrast, the bounded TSO[S] model [29], which does not have a global clock, would require reading A when retiring an object. )…”

Section: Adapting Tbtso Algorithms To X86 With Os Helpmentioning

confidence: 99%

“…While all biased locks collapse compared to pthreads in this case, we see the benefit of bounded delay: the FFBL outperform the safe point lock by 50× (with echoes) and by 7× (without echoes). [29]. In contrast, TBTSO's temporal reordering bound facilitates nonblocking synchronization without relaxing semantics, making it more broadly applicable.…”

Section: Biased Locksmentioning

confidence: 99%

See 2 more Smart Citations

Temporally Bounding TSO for Fence-Free Asymmetric Synchronization

Morrison

Afek

2015

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems

Self Cite

View full text Add to dashboard Cite

This paper introduces a temporally bounded total store ordering (TBTSO) memory model, and shows that it enables nonblocking fence-free solutions to asymmetric synchronization problems, such as those arising in memory reclamation and biased locking.TBTSO strengthens the TSO memory model by bounding the time it takes a store to drain from the store buffer into memory. This bound enables devising fence-free algorithms for asymmetric problems, which require a performancecritical fast path to synchronize with an infrequently executed slow path. We demonstrate this by constructing (1) a fence-free version of the hazard pointers memory reclamation scheme, and (2) a fence-free biased lock algorithm which is compatible with unmanaged environments as it does not rely on safe points or similar mechanisms.We further argue that TBTSO can be implemented in hardware with modest modifications to existing TSO architectures. However, our design makes assumptions about proprietary implementation details of commercial hardware; it thus best serves as a starting point for a discussion on the feasibility of hardware TBTSO implementation. We also show how minimal OS support enables the adaptation of TBTSO algorithms to x86 systems.

show abstract