Exploiting recent SIMD architectural advances for irregular applications

Chen, Linchuan; Jiang, Peng; Agrawal, Gagan

doi:10.1145/2854038.2854046

Cited by 36 publications

(14 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For stripe splitting mode, we need to run a lock-based code for tiles in the same stripe as we mentioned before. In SIMD-level, as opposed to previous efforts based on a heavy data reorganization preprocessing [11], we rely on the built-in conflicts detection intrinsics provided by Xeon Phi to dynamically address the possible update conflicts.…”

Section: Update Conflictmentioning

confidence: 99%

“…GraphPhi focuses on a different throughputoriented architecture and explores many unique features that are not shown on GPUs. There are also some optimization techniques for Xeon Phi [6,11,22]. Although they comprehensively explore advanced SIMD execution, none of them offer a general graph processing framework by effectively exploiting both MIMD and SIMD execution, or emerging HBM techniques.…”

Section: Related Workmentioning

confidence: 99%

“…There exist several graph processing frameworks and libraries based on popular many-core processors such as GPUs [23,44] and early versions of Xeon Phis [11,22,29]. In addition, many other graph processing frameworks [33,40] designed for shared-memory multi-core CPUs are also capable of running on Xeon Phis owing to their x86-compatibility.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Graphphi

Peng

Powell

et al. 2018

Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Modern parallel architecture design has increasingly turned to throughput-oriented devices to address concerns about energy efficiency and power consumption. However, graph applications cannot tap into the full potential of such architectures because of highly unstructured computations and irregular memory accesses. In this paper, we present GraphPhi, a new approach to graph processing on emerging Intel Xeon Phi-like architectures, by addressing the restrictions of migrating existing graph processing frameworks on shared-memory multi-core CPUs to this new architecture. Specifically, GraphPhi consists of 1) an optimized hierarchically blocked graph representation to enhance the data locality for both edges and vertices within and among threads, 2) a hybrid vertexcentric and edge-centric execution to efficiently find and process active edges, and 3) a uniform MIMD-SIMD scheduler integrated with a lock-free update support to achieve both good thread-level load balance and SIMD-level utilization. Besides, our efficient MIMD-SIMD execution is capable of hiding memory latency by increasing the number of concurrent memory access requests, thus benefiting more from the latest High-Bandwidth Memory technique. We evaluate our GraphPhi on six graph processing applications. Compared to two state-of-the-art shared-memory graph processing frameworks, it results in speedups up to 4X and 35X , respectively.

show abstract

Section: Update Conflictmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Graphphi

Peng

Powell

et al. 2018

Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

show abstract

“…Green‐Marl is a domain specific language for graph processing. Chen et al proposed compiler optimization methodology for graph and other irregular applications on Intel Xeon Phi coprocessors. Ahn et al developed a customized processing‐in‐memory (PIM) accelerator for large‐scale graph processing.…”

Section: Related Workmentioning

confidence: 99%

Efficient and high‐quality sparse graph coloring on GPUs

Chen

Fang

et al. 2016

Concurrency and Computation

View full text Add to dashboard Cite

Summary Graph coloring has been broadly used to discover concurrency in parallel computing. To speed up graph coloring for large‐scale datasets, parallel algorithms have been proposed to leverage modern GPUs. Existing GPU implementations either have limited performance or yield unsatisfactory coloring quality (too many colors assigned). We present a work‐efficient parallel graph coloring implementation on GPUs with good coloring quality. Our approach uses the speculative greedy scheme, which inherently yields better quality than the method of finding maximal independent set. To achieve high performance on GPUs, we refine the algorithm to leverage efficient operators and alleviate conflicts. We also incorporate common optimization techniques to further improve performance. Our method is evaluated with both synthetic and real‐world sparse graphs on the NVIDIA GPU. Experimental results show that our proposed implementation achieves averaged 4.1 × (up to 8.9 × ) speedup over the serial implementation. It also outperforms the existing GPU implementation from the NVIDIA CUSPARSE library (2.2 × average speedup), while yielding much better coloring quality than CUSPARSE.

show abstract

“…al. propose a general optimization technique for data-parallel problems with indirect memory accesses [1], by viewing the problem as a sparse matrix computation. Yavors'kii and Weigel identified that tiled computations, such as the ones found in spin systems, are greatly improved by re-organizing the memory in blocks [12].…”

Section: Related Workmentioning

confidence: 99%

Potential benefits of a block-space GPU approach for discrete tetrahedral domains

Navarro

Bustos

Hitschfeld

2016

2016 XLII Latin American Computing Conference (CLEI)

View full text Add to dashboard Cite

The study of data-parallel domain re-organization and thread-mapping techniques are relevant topics as they can increase the efficiency of GPU computations when working on spatial discrete domains with non-box-shaped geometry. In this work we study the potential benefits of applying a succint data re-organization of a tetrahedral data-parallel domain of size O(n 3 ) combined with an efficient block-space GPU map of the form g(λ) : N → N 3 . Results from the analysis suggest that in theory the combination of these two optimizations produce significant performance improvement as block-based data reorganization allows a coalesced one-to-one correspondence at local thread-space while g(λ) produces an efficient block-space spatial correspondence between groups of data and groups of threads, reducing the number of unnecessary threads from O(n 3 ) to O(n 2 ρ 3 ) where ρ is the linear block-size and typically ρ 3 ≪ n. From the analysis, we obtained that a block based succint data re-organization can provide up to 2× improved performance over a linear data organization while the map can be up to 6× more efficient than a bounding box approach. The results from this work can serve as a useful guide for a more efficient GPU computation on tetrahedral domains found in spin lattice, finite element and special n-body problems, among others.

show abstract

Exploiting recent SIMD architectural advances for irregular applications

Cited by 36 publications

References 31 publications

Graphphi

Graphphi

Efficient and high‐quality sparse graph coloring on GPUs

Potential benefits of a block-space GPU approach for discrete tetrahedral domains

Contact Info

Product

Resources

About