Memory management for many-core processors with software configurable locality policies

Zhou, Junlan; Demsky, Brian

doi:10.1145/2258996.2259000

Cited by 18 publications

(6 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…. , , current node n, cores per node C) (2) Populate [1 : ] with bytes in T.depend list; (3) if ( ) > ( )/ and V ( ) > 0 then (4) find with least NUMA distance-weighted cost to ; (5) enqueue( , T); (6) else (7) enqueue( , T); (9) find with least home cache latency cost to ; (10) enqueue( , T); (11) else (12) enqueue( , T); (13) end (14) end (15) else (16) enqueue( , T); (17) end (18) end Algorithm 3: Work-dealing algorithm for TILEPro64. spent waiting for memory by counting dispatch stall cycles which includes load/store unit stall cycles [13].…”

Section: Potential For Performance Improvementsmentioning

confidence: 99%

“…Tousimojarad and Vanderbauwhede [33] cleverly reduce access latencies to uniformly distributed data by using copies whose home cache is local to the access thread on the TILEPro64 processor. Zhou and Demsky [2] build a NUMAaware adaptive garbage collector that migrate objects to improve locality on manycore processors. We target standard OpenMP programs written in C which makes it difficult to migrate objects.…”

Section: Related Workmentioning

confidence: 99%

“…The latency of accessing far-off remote cache banks approaches off-chip memory access latencies. Another performance consideration is that cache coherence of manycore processors is software configurable [2]. Scheduling should adapt to remote cache bank access latencies that can change based on the configuration.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Muddukrishna

Jönsson

Brorsson

2015

Scientific Programming

View full text Add to dashboard Cite

Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and manycore processor caches is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node/manycore cache access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA system/manycore processor architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor and identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers.

show abstract

Section: Potential For Performance Improvementsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Muddukrishna

Jönsson

Brorsson

2015

Scientific Programming

View full text Add to dashboard Cite

show abstract

“…Classified as shared, however, the mapping reverts back to being statically mapped. Recent proposals have also explored replication [50,51], coherence protocol based optimization [52,53,54] and software configurable policies [55,56], trading implementation complexity for performance. With regard to cache line placement, this paper explores runtime modification of the home node and how to support it at the software-hardware interface.…”

Section: Related Workmentioning

confidence: 99%

Coherence domain restriction on large scale systems

Nguyen

Wentzlaff

2015

Proceedings of the 48th International Symposium on Microarchitecture

View full text Add to dashboard Cite

Designing massive scale cache coherence systems has been an elusive goal. Whether it be on large-scale GPUs, future thousand-core chips, or across millioncore warehouse scale computers, having shared memory, even to a limited extent, improves programmability. This work sidesteps the traditional challenges of creating massively scalable cache coherence by restricting coherence to flexible subsets (domains) of a system's total cores and home nodes. This paper proposes Coherence Domain Restriction (CDR), a novel coherence framework that enables the creation of thousand to million core systems that use shared memory while maintaining low storage and energy overhead. Inspired by the observation that the majority of cache lines are only shared by a subset of cores either due to limited application parallelism or limited page sharing, CDR restricts the coherence domain from global cache coherence to VM-level, application-level, or page-level. We explore two types of restriction, one which limits the total number of sharers that can access a coherence domain and one which limits the number and location of home nodes that partake in a coherence domain. Each independent coherence domain only tracks the cores in its domain instead of the whole system, thereby removing the need for a coherence scheme built on top of CDR to scale. Sharer Restriction achieves constant storage overhead as core count increases while Home Restriction provides localized communication enabling higher performance. Unlike previous systems, CDR is flexible and does not restrict the location of the home nodes or sharers within a domain. We evaluate CDR in the context of a 1024-core chip and in the novel application of shared memory to a 1,000,000-core warehouse scale computer. Sharer Restriction results in significant area savings, while Home Restriction in the 1024-core chip and 1,000,000-core system increases performance

show abstract

“…However, we still use interleaved spaces for the old and permanent generations, as these generations use a compacting algorithm. Zhou and Demsky [32] propose a NUMA-aware compaction algorithm, but this is out of our scope. Furthermore, our results show that using a fragmented space for the other generations is not required to make the garbage collector scale.…”

Section: Fragmented and Segregated Spacesmentioning

confidence: 99%

A study of the scalability of stop-the-world garbage collectors on multicores

GidraLokesh

ThomasGaël

SopenaJulien

et al. 2013

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

Large-scale multicore architectures create new challenges for garbage collectors (GCs). In particular, throughput-oriented stop-the-world algorithms demonstrate good performance with a small number of cores, but have been shown to degrade badly beyond approximately 8 cores on a 48-core with OpenJDK 7. This negative result raises the question whether the stop-the-world design has intrinsic limitations that would require a radically different approach. Our study suggests that the answer is no, and that there is no compelling scalability reason to discard the existing highly-optimised throughput-oriented GC code on contemporary hardware. This paper studies the default throughput-oriented garbage collector of OpenJDK 7, called Parallel Scavenge. We identify its bottlenecks, and show how to eliminate them using well-established parallel programming techniques. On the SPECjbb2005, SPECjvm2008 and DaCapo 9.12 benchmarks, the improved GC matches the performance of Parallel Scavenge at low core count, but scales well, up to 48~cores.

show abstract

Memory management for many-core processors with software configurable locality policies

Cited by 18 publications

References 28 publications

Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Coherence domain restriction on large scale systems

A study of the scalability of stop-the-world garbage collectors on multicores

Contact Info

Product

Resources

About