Thread scheduling for cache locality

Philbin, James; Edler, Jan; Anshus, Otto J.; Douglas, Craig C.; Li, Kai

doi:10.1145/237090.237151

Cited by 69 publications

(37 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [15], Philbin et al reordered loops in sequential applications to improve locality, using information about data accesses. In the realm of task-parallelism, Chen et al proposed scheduling concurrent execution in order to promote cache sharing on CMPs [2].…”

Section: Related Workmentioning

confidence: 99%

Scheduling strategies for optimistic parallel execution of irregular programs

Kulkarni

Carribault

Pingali

et al. 2008

Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures

View full text Add to dashboard Cite

Recent application studies have shown that many irregular applications have a generalized data parallelism that manifests itself as iterative computations over worklists of different kinds. In general, there are complex dependencies between iterations. These dependencies cannot be elucidated statically because they depend on the inputs to the program; thus, optimistic parallel execution is the only tractable approach to parallelizing these applications.We have built a system called Galois that supports this style of parallel execution. Its main features are (i) set iterators for expressing worklist-based data parallelism, and (ii) a runtime system that performs optimistic parallelization of these iterators, detecting conflicts and rolling back computations as needed.Our work builds on the Galois system, and it addresses the problem of scheduling iterations of set iterators on multiple cores. The policy used by the base Galois system is to assign an iteration to a core whenever it needs work to do, but we show in this paper that this policy is not optimal for many applications. We also argue that OpenMP-style DO-ALL loop scheduling directives such as chunked and guided self-scheduling are too simplistic for irregular programs. These difficulties led us to develop a general scheduling framework for irregular problems; OpenMP-style scheduling strategies are special cases of this general approach. We also provide hooks into our framework, allowing the programmer to leverage application knowledge to further tune a schedule for a particular application.To evaluate this framework, we implemented it as an extension of the Galois system. We then tested the system using five realworld, irregular, data-parallel applications. Our results show that (i) the optimal scheduling policy can be different for different ap- *

show abstract

Section: Related Workmentioning

confidence: 99%

Scheduling strategies for optimistic parallel execution of irregular programs

Kulkarni

Carribault

Pingali

et al. 2008

Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures

View full text Add to dashboard Cite

show abstract

“…While their approach is not particularly well-suited to non-real- time systems, their micro-benchmark results do indicate that intelligent co-scheduling of cooperative threads can reduce the number of L2 misses substantially. Philbin et al [30] studied the possibility of reducing cache misses for sequential programs through intelligent scheduling of fine-grained threads. Their approach relies on memory access hints in the program to identify threads that should execute in close temporal proximity in order to promote cache re-use.…”

Section: Related Workmentioning

confidence: 99%

Scheduling threads for constructive cache sharing on CMPs

Chen

Gibbons

Kozuch

et al. 2007

Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures

120

View full text Add to dashboard Cite

In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3-1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.

show abstract

“…For instance, Philbin et al [11] formalise the problem of locality-aware thread scheduling for a single-core processor. In other work by Tam et al [14], threads are grouped based on data-locality for multi-threaded multi-core processors, introducing a metric of thread similarity.…”

Section: Related Workmentioning

confidence: 99%

A Study of the Potential of Locality-Aware Thread Scheduling for GPUs

Nugteren

Braak

Corporaal

2014

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Programming models such as CUDA and OpenCL allow the programmer to specify the independence of threads, effectively removing ordering constraints. Still, parallel architectures such as the graphics processing unit (GPU) do not exploit the potential of data-locality enabled by this independence. Therefore, programmers are required to manually perform data-locality optimisations such as memory coalescing or loop tiling. This work makes a case for locality-aware thread scheduling: re-ordering threads automatically for better locality to improve the programmability of multi-threaded processors. In particular, we analyse the potential of locality-aware thread scheduling for GPUs, considering among others cache performance, memory coalescing and bank locality. This work does not present an implementation of a locality-aware thread scheduler, but rather introduces the concept and identifies the potential. We conclude that non-optimised programs have the potential to achieve good cache and memory utilisation when using a smarter thread scheduler. A case-study of a naive matrix multiplication shows for example a 87% performance increase, leading to an IPC of 457 on a 512-core GPU.

show abstract

Thread scheduling for cache locality

Cited by 69 publications

References 36 publications

Scheduling strategies for optimistic parallel execution of irregular programs

Scheduling strategies for optimistic parallel execution of irregular programs

Scheduling threads for constructive cache sharing on CMPs

A Study of the Potential of Locality-Aware Thread Scheduling for GPUs

Contact Info

Product

Resources

About