Trading cache hit rate for memory performance

Ding, Wei; Kandemir, Mahmut; Guttman, Diana; Jog, Adwait; Das, Chita R.; Yedlapalli, Praveen

doi:10.1145/2628071.2628082

Cited by 2 publications

(2 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Expensive locality reorganization methods have not been able to amortize costs within a single iteration, and they are limited to applications that repeatedly process the same references, e.g., rearranging the index array [17,32] and remapping all arrays in a loop, or graph partitioning [22,23] or cheaper reorderings with lower benefits (e.g., space filling curves). Recent inspector/executor work [18] traded lower cache hit rate for improvement of DRAM row buffer hits for 14% net gains. Milk achieves up to 4× gains on static reference loops, and pays off in one iteration to also allow dynamic references.…”

Section: Related Workmentioning

confidence: 99%

“…In Delivery, each partition's deferred updates are read from DRAM and processed, along with dependent statements. These three logical phases are similar to inspector-executor style optimizations [18,32,40]; in Milk, however, to eliminate materialization of partitions and conserve DRAM bandwidth, the phases are fused and run as coroutines. Prior research either focused on expensive preprocessing that resulted in net performance gain only when amortized over many loop executions, or explored simple inspection for correspondingly modest gains.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Optimizing Indirect Memory References with milk

Kiriansky

Zhang

Amarasinghe

2016

Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

View full text Add to dashboard Cite

Modern applications such as graph and data analytics, when operating on real world data, have working sets much larger than cache capacity and are bottlenecked by DRAM. To make matters worse, DRAM bandwidth is increasing much slower than per CPU core count, while DRAM latency has been virtually stagnant. Parallel applications that are bound by memory bandwidth fail to scale, while applications bound by memory latency draw a small fraction of much-needed bandwidth. While expert programmers may be able to tune important applications by hand through heroic effort, traditional compiler cache optimizations have not been sufficiently aggressive to overcome the growing DRAM gap. In this paper, we introduce milk-a C/C++ language extension that allows programmers to annotate memorybound loops concisely. Using optimized intermediate data structures, random indirect memory references are transformed into batches of efficient sequential DRAM accesses. A simple semantic model enhances programmer productivity for efficient parallelization with OpenMP. We evaluate the Milk compiler on parallel implementations of traditional graph applications, demonstrating performance gains of up to 3×.

show abstract