Using memory mapping to support cactus stacks in work-stealing runtime systems

Lee, I-Ting Angelina; Boyd-Wickizer, Silas; Huang, Zhiyi; Leiserson, Charles E.

doi:10.1145/1854273.1854324

Cited by 34 publications

(30 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nonetheless, the internal data structure size grows moderately as the thread count increases by about 18 KB per thread for CilkMR and about 1 KB per thread for Phoenix++. This difference arises because CilkMR requires a varying number of stacks depending on how work stealing progresses [21].…”

Section: Memory Consumptionmentioning

confidence: 99%

A scalable and composable map-reduce system

Arif

Vandierendonck

Nikolopoulos

et al. 2016

2016 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

Section: Memory Consumptionmentioning

confidence: 99%

A scalable and composable map-reduce system

Arif

Vandierendonck

Nikolopoulos

et al. 2016

2016 IEEE International Conference on Big Data (Big Data)

View full text Add to dashboard Cite

“…This section describes thread-local memory mapping (TLMM) [22,23], which Cilk-M uses to cause the virtual-memory hardware to map reducers to local views. TLMM provides an efficient and flexible way for a thread to map certain regions of virtual memory independently from other threads while still sharing most of its virtual-memory address space.…”

Section: Thread-local Memory Mappingmentioning

confidence: 99%

“…Cilk-M's TLMM mechanism [22,23] was originally developed to enable a work-stealing runtime system to maintain a "cactusstack" abstraction, thereby allowing arbitrary calling between parallel and serial code. To implement TLMM, we modified the Linux kernel, producing a kernel version we call TLMM-Linux.…”

Section: Thread-local Memory Mappingmentioning

confidence: 99%

“…The operating-system support employs thread-local memory mapping (TLMM) [23]. TLMM enables the virtual-memory The relative overhead for ordinary L1-cache memory accesses, memory-mapped reducers, hypermap reducers, and locking.…”

Section: Introductionmentioning

confidence: 99%

“…This support for efficient view transferal allows workers to perform reductions without extra memory mapping. We implemented our memory-mapping strategy by modifying Cilk-M [23], a Cilk runtime system that employs TLMM to manage the "cactus stack," to make it a plug-in replacement for Intel's Cilk Plus runtime system. That is, we modified the Cilk-M runtime system to replace the native Cilk runtime system shipped with Intel's C++ compiler by making Cilk-M conform to the Intel Cilk Plus Application Binary Interface (ABI) [17].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Memory-mapping support for reducer hyperobjects

Lee¹,

Shafi

Leiserson³

2012

Proceedings of the Twenty-Fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures

Self Cite

View full text Add to dashboard Cite

Reducer hyperobjects (reducers) provide a linguistic abstraction for dynamic multithreading that allows different branches of a parallel program to maintain coordinated local views of the same nonlocal variable. In this paper, we investigate how thread-local memory mapping (TLMM) can be used to improve the performance of reducers. Existing concurrency platforms that support reducer hyperobjects, such as Intel Cilk Plus and Cilk++, take a hypermap approach in which a hash table is used to map reducer objects to their local views. The overhead of the hash table is costly -roughly 12× overhead compared to a normal L1-cache memory access on an AMD Opteron 8354. We replaced the Intel Cilk Plus runtime system with our own Cilk-M runtime system which uses TLMM to implement a reducer mechanism that supports a reducer lookup using only two memory accesses and a predictable branch, which is roughly a 3× overhead compared to an ordinary L1-cache memory access. An empirical evaluation shows that the Cilk-M memory-mapping approach is close to 4× faster than the Cilk Plus hypermap approach. Furthermore, the memory-mapping approach admits better locality than the hypermap approach during parallel execution, which allows an application using reducers to scale better.

show abstract