Using MIC to Accelerate a Typical Data-Intensive Application: The Breadth-first Search

Gao, Tao; Lu, Yutong; Suo, Guang

doi:10.1109/ipdpsw.2013.197

Cited by 20 publications

(17 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The IWPP resembles graph scan algorithms with multiple sources, which have been the target of a number of recent research projects that implemented, for instance, Breadth-First Search (BFS). 22,23 Hong et al 22 presented approaches to minimize the load imbalance that occurs when processing graphs in which the number of edges may vary from vertices. Tao et al 23 Roman, 25 while other works used devices such as Field-Programmable Gate Arrays (FPGAs) and GPUs to implement this operation.…”

Section: Related Workmentioning

confidence: 99%

“…22,23 Hong et al 22 presented approaches to minimize the load imbalance that occurs when processing graphs in which the number of edges may vary from vertices. Tao et al 23 Roman, 25 while other works used devices such as Field-Programmable Gate Arrays (FPGAs) and GPUs to implement this operation. [26][27][28] A common limitation with these solutions is that they were not built on top of the must efficient sequential algorithm that uses queues.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel^® Xeon Phi™

Gomes

Melo

Kong

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary The Irregular Wavefront Propagation Pattern (IWPP) is a core computing structure in several image analysis operations. Efficient implementation of IWPP on the Intel Xeon Phi is difficult because of the irregular data access and computation characteristics. The traditional IWPP algorithm relies on atomic instructions, which are not available in the SIMD set of the Intel Phi. To overcome this limitation, we have proposed a new IWPP algorithm that can take advantage of non-atomic SIMD instructions supported on the Intel Xeon Phi. We have also developed and evaluated methods to use CPU and Intel Phi cooperatively for parallel execution of the IWPP algorithms. Our new cooperative IWPP version is also able to handle large out-of-core images that would not fit into the memory of the accelerator. The new IWPP algorithm is used to implement the Morphological Reconstruction and Fill Holes operations, which are operations commonly found in image analysis applications. The vectorization implemented with the new IWPP has attained improvements of up to about 5× on top of the original IWPP and significant gains as compared to state-of-the-art the CPU and GPU versions. The new version running on an Intel Phi is 6.21× and 3.14× faster than running on a 16-core CPU and on a GPU, respectively. Finally, the cooperative execution using two Intel Phi devices and a multi-core CPU has reached performance gains of 2.14× as compared to the execution using a single Intel Xeon Phi.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel^® Xeon Phi™

Gomes

Melo

Kong

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…As such, recent efforts on efficient implementations of Breadth-First Search (BFS) [16] [17] are interesting for the sake of comparison with IWPP execution schemes and optimizations. The work of Hong et al [16], for instance, provides techniques and optimizations to deal with load imbalance from irregular number of edges in vertices from real-world graphs for their GPU-based BFS algorithms.…”

Section: Related Workmentioning

confidence: 99%

“…Although these techniques have shown to be effective to their work, it would have no impact in IWPP that has a regular and constant number of edges per vertex, represented by the fixed neighborhood. Tao et al [17] is a closer related work that describes approaches to accelerate BFS using the Intel Phi. It develops reading and expansion operations using SIMD instructions, but it still uses atomic (non-vectorized) instructions to perform expansion of vertices.…”

Section: Related Workmentioning

confidence: 99%

Efficient Irregular Wavefront Propagation Algorithms on Intel(R) Xeon Phi(TM)

Gomes

Teodoro

Melo

et al. 2015

2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

View full text Add to dashboard Cite

We investigate the execution of the Irregular Wavefront Propagation Pattern (IWPP), a fundamental computing structure used in several image analysis operations, on the Intel® Xeon Phi™ co-processor. An efficient implementation of IWPP on the Xeon Phi is a challenging problem because of IWPP’s irregularity and the use of atomic instructions in the original IWPP algorithm to resolve race conditions. On the Xeon Phi, the use of SIMD and vectorization instructions is critical to attain high performance. However, SIMD atomic instructions are not supported. Therefore, we propose a new IWPP algorithm that can take advantage of the supported SIMD instruction set. We also evaluate an alternate storage container (priority queue) to track active elements in the wavefront in an effort to improve the parallel algorithm efficiency. The new IWPP algorithm is evaluated with Morphological Reconstruction and Imfill operations as use cases. Our results show performance improvements of up to 5.63× on top of the original IWPP due to vectorization. Moreover, the new IWPP achieves speedups of 45.7× and 1.62×, respectively, as compared to efficient CPU and GPU implementations.

show abstract

“…For instance, Hong et al in [6] presented a hybrid method which dynamically decides the best execution method for each BFS-level iteration, shifting between sequential execution, multi-core CPU-only execution, and GPUs. Tao, Yutong and Guang [7] developed two different approaches to improve the performance of BFS algorithm on an Intel Xeon Phi coprocessor.…”

Section: Related Workmentioning

confidence: 99%

Characterizing the Communication Demands of the Graph500 Benchmark on a Commodity Cluster

Fuentes

Bosque

Beivide

et al. 2014

2014 IEEE/ACM International Symposium on Big Data Computing

View full text Add to dashboard Cite

. Abstract can be read here. Copyright belongs to IEEE.Abstract-Big Data applications have gained importance over the last few years. Such applications focus on the analysis of huge amounts of unstructured information and present a series of differences with traditional High Performance Computing (HPC) applications. For illustrating such dissimilarities, this paper analyzes the behavior of the most scalable version of the Graph500 benchmark when run on a state-of-the-art commodity cluster facility. Our work shows that this new computation paradigm stresses the interconnection subsystem.In this work, we provide both analytical and empirical characterizations of the Graph500 benchmark, showing that its communication needs bound the achieved performance on a cluster facility. Up to our knowledge, our evaluation is the first to consider the impact of message aggregation on the communication overhead and explore a tradeoff that diminishes benchmark execution time, increasing system performance.

show abstract

Using MIC to Accelerate a Typical Data-Intensive Application: The Breadth-first Search

Cited by 20 publications

References 16 publications

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel^® Xeon Phi™

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel^® Xeon Phi™

Efficient Irregular Wavefront Propagation Algorithms on Intel(R) Xeon Phi(TM)

Characterizing the Communication Demands of the Graph500 Benchmark on a Commodity Cluster

Contact Info

Product

Resources

About

Using MIC to Accelerate a Typical Data-Intensive Application: The Breadth-first Search

Cited by 20 publications

References 16 publications

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel® Xeon Phi™

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel® Xeon Phi™

Efficient Irregular Wavefront Propagation Algorithms on Intel(R) Xeon Phi(TM)

Characterizing the Communication Demands of the Graph500 Benchmark on a Commodity Cluster

Contact Info

Product

Resources

About

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel^® Xeon Phi™

Cooperative and out‐of‐core execution of the irregular wavefront propagation pattern on hybrid machines with Intel^® Xeon Phi™