Cache-conscious scheduling of streaming applications

Agrawal, Kunal; Fineman, Jeremy T.; Krage, Jordan; Leiserson, Charles E.; Toledo, Sivan

doi:10.1145/2312005.2312049

Cited by 10 publications

(5 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[3] present theoretical cache miss limits when scheduling streaming applications represented as directed graphs on uniprocessors. Their work shows that scheduling the graph by selecting partitions comes within a constant factor of the optimal scheduler when heuristics such as working set and data usage rates are known in advance.…”

Section: Improving Cache Efficiencymentioning

confidence: 99%

Cache-Conscious Wavefront Scheduling

Rogers

O'Connor

Aamodt

2012

2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

363

315

View full text Add to dashboard Cite

This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wavefront Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wavefront locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity. In contrast to improvements in the replacement policy that can better tolerate difficult access patterns, CCWS shapes the access pattern to avoid thrashing the shared L1. We show that CCWS can outperform any replacement scheme by evaluating against the Belady-optimal policy. Our evaluation demonstrates that cache efficiency and preservation of intra-wavefront locality become more important as GPU computing expands beyond use in high performance computing. At an estimated cost of 0.17% total chip area, CCWS reduces the number of threads actively issued on a core when appropriate. This leads to an average 25% fewer L1 data cache misses which results in a harmonic mean 24% performance improvement over previously proposed scheduling policies across a diverse selection of cache-sensitive workloads.

show abstract

Section: Improving Cache Efficiencymentioning

confidence: 99%

Cache-Conscious Wavefront Scheduling

Rogers

O'Connor

Aamodt

2012

2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

363

315

View full text Add to dashboard Cite

show abstract

“…Closer to our objective of optimizing the parallel execution time, another formulation of the DAG partitioning problem arises in exposing parallelism in automatic differentiation [4,Ch.9], and in general, in the computation of the Newton step for solving nonlinear systems [5]. Other important applications of the DAG partitioning problem include (i) fusing loops for improving temporal locality, and enabling streaming and array contractions in runtime systems [6], such as Bohrium [7]; (ii) analysis of cache efficient execution of streaming applications on uniprocessors [8].…”

Section: Introductionmentioning

confidence: 99%

Acyclic Partitioning of Large Directed Acyclic Graphs

Herrmann

Kho

Uçar

et al. 2017

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

View full text Add to dashboard Cite

Abstract-Finding a good partition of a computational directed acyclic graph associated with an algorithm can help find an execution pattern improving data locality, conduct an analysis of data movement, and expose parallel steps. The partition is required to be acyclic, i.e., the inter-part edges between the vertices from different parts should preserve an acyclic dependency structure among the parts. In this work, we adopt the multilevel approach with coarsening, initial partitioning, and refinement phases for acyclic partitioning of directed acyclic graphs and develop a direct k-way partitioning scheme. To the best of our knowledge, no such scheme exists in the literature. To ensure the acyclicity of the partition at all times, we propose novel and efficient coarsening and refinement heuristics. The quality of the computed acyclic partitions is assessed by computing the edge cut, the total volume of communication between the parts, and the critical path latencies. We use the solution returned by well-known undirected graph partitioners as a baseline to evaluate our acyclic partitioner, knowing that the space of solution is more restricted in our problem. The experiments are run on large graphs arising from linear algebra applications.

show abstract

“…First, we plan to experiment our monitoring approach over platforms for which memory latency vary more and for which the placement decisions will have a greater impact. This will be a strong complement to existing compilation strategies that already take the underlying memory hierarchy into account, eg [25], [22], [11], [10], [1].…”

Section: Discussionmentioning

confidence: 99%

A Monitoring System for Runtime Adaptations of Streaming Applications

Selva

Morel

Marquet

et al. 2015

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

View full text Add to dashboard Cite

Streaming languages are adequate for expressing many applications quite naturally and have been proven to be a good approach for taking advantage of the intrinsic parallelism of modern CPU architectures. While numerous works focus on improving the throughput of streaming programs, we rather focus on satisfying quality-of-service requirements of streaming applications executed along-side non-streaming processes. We monitor synchronous dataflow (SDF) programs at runtime both at the application and system levels in order to identify violations of quality-of-service requirements. Our monitoring requires the programmer to provide the expected throughput of its application (e.g 25 frames per second for a video decoder), then takes full benefit from the compilation of the SDF graph to detect bottlenecks in this graph and identify causes among processor or memory overloading. It can then be used to perform dynamic adaptations of the applications in order to optimize the use of computing and memory resources.

show abstract

Cache-conscious scheduling of streaming applications

Cited by 10 publications

References 29 publications

Cache-Conscious Wavefront Scheduling

Cache-Conscious Wavefront Scheduling

Acyclic Partitioning of Large Directed Acyclic Graphs

A Monitoring System for Runtime Adaptations of Streaming Applications

Contact Info

Product

Resources

About