Youngjoon Jo scite author profile

2011

While there has been decades of work on developing automatic, locality-enhancing transformations for regular programs that operate over dense matrices and arrays, there has been little investigation of such transformations for irregular programs, which operate over pointer-based data structures such as graphs, trees and lists. In this paper, we argue that, for a class of irregular applications we call traversal codes, there exists substantial data reuse and hence opportunity for locality exploitation.We develop a novel optimization called point blocking, inspired by the classic tiling loop transformation, and show that it can substantially enhance temporal locality in traversal codes. We then present a transformation and optimization framework called TreeTiler that automatically detects opportunities for applying point blocking and applies the transformation. TreeTiler uses autotuning techniques to determine appropriate parameters for the transformation. For a series of traversal algorithms drawn from real-world applications, we show that TreeTiler is able to deliver performance improvements of up to 245% over an optimized (but non-transformed) parallel baseline, and in several cases, significantly better scalability.

Automatically enhancing locality for tree traversals with traversal splicing

2012

Generally applicable techniques for improving temporal locality in irregular programs, which operate over pointerbased data structures such as trees and graphs, are scarce. Focusing on a subset of irregular programs, namely, tree traversal algorithms like Barnes-Hut and nearest neighbor, previous work has proposed point blocking, a technique analogous to loop tiling in regular programs, to improve locality. However point blocking is highly dependent on point sorting, a technique to reorder points so that consecutive points will have similar traversals. Performing this a priori sort requires an understanding of the semantics of the algorithm and hence highly application specific techniques.In this work, we propose traversal splicing, a new, general, automatic locality optimization for irregular tree traversal codes, that is less sensitive to point order, and hence can deliver substantially better performance, even in the absence of semantic information. For six benchmark algorithms, we show that traversal splicing can deliver singlethread speedups of up to 9.147 (geometric mean: 3.095) over baseline implementations, and up to 4.752 (geometric mean: 2.079) over point-blocked implementations. Further, we show that in many cases, automatically applying traversal splicing to a baseline implementation yields performance that is better than carefully hand-optimized implementations.

General transformations for GPU execution of tree traversals

Goldfarb

2013

With the advent of programmer-friendly GPU computing environments, there has been much interest in offloading workloads that can exploit the high degree of parallelism available on modern GPUs. Exploiting this parallelism and optimizing for the GPU memory hierarchy is well-understood for regular applications that operate on dense data structures such as arrays and matrices. However, there has been significantly less work in the area of irregular algorithms and even less so when pointer-based dynamic data structures are involved. Recently, irregular algorithms such as Barnes-Hut and kd-tree traversals have been implemented on GPUs, yielding significant performance gains over CPU implementations. However, the implementations often rely on exploiting application-specific semantics to get acceptable performance. We argue that there are general-purpose techniques for implementing irregular algorithms on GPUs that exploit similarities in algorithmic structure rather than application-specific knowledge. We demonstrate these techniques on several tree traversal algorithms, achieving speedups of up to 38× over 32-thread CPU versions.

Efficient execution of recursive programs on commodity vector hardware

Ren

Krishnamoorthy

et al. 2015

SIGPLAN Not.

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel's SSE4.2 vector units, as well as accelerators using Intel's AVX512 units.

Automatically enhancing locality for tree traversals with traversal splicing

2012

SIGPLAN Not.

Generally applicable techniques for improving temporal locality in irregular programs, which operate over pointerbased data structures such as trees and graphs, are scarce. Focusing on a subset of irregular programs, namely, tree traversal algorithms like Barnes-Hut and nearest neighbor, previous work has proposed point blocking, a technique analogous to loop tiling in regular programs, to improve locality. However point blocking is highly dependent on point sorting, a technique to reorder points so that consecutive points will have similar traversals. Performing this a priori sort requires an understanding of the semantics of the algorithm and hence highly application specific techniques. In this work, we propose traversal splicing, a new, general, automatic locality optimization for irregular tree traversal codes, that is less sensitive to point order, and hence can deliver substantially better performance, even in the absence of semantic information. For six benchmark algorithms, we show that traversal splicing can deliver singlethread speedups of up to 9.147 (geometric mean: 3.095) over baseline implementations, and up to 4.752 (geometric mean: 2.079) over point-blocked implementations. Further, we show that in many cases, automatically applying traversal splicing to a baseline implementation yields performance that is better than carefully hand-optimized implementations.