Xin Huo scite author profile

The Intel Xeon Phi offers a promising solution to coprocessing, since it is based on the popular x86 instruction set. However, to fully utilize its potential, applications must be vectorized to leverage the wide SIMD lanes, in addition to effective large-scale shared memory parallelism. Compared to the SIMT execution model on GPGPUs with CUDA or OpenCL, SIMD parallelism with a SSE-like instruction set imposes many restrictions, and has generally not benefitted applications involving branches, irregular accesses, or even reductions in the past. In this paper, we consider the problem of accelerating applications involving different communication patterns on Xeon Phis, with an emphasis on effectively using available SIMD parallelism. We offer an API for both shared memory and SIMD parallelization, and demonstrate its implementation. We use implementations of overloaded functions as a mechanism for providing SIMD code, which is assisted by runtime data reordering and our methods to effectively manage control flow. Our extensive evaluation with 6 popular applications shows large gains over the SIMD parallelization achieved by the production (ICC) compiler, and we even outperform OpenMP for MIMD parallelism.

show abstract

An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs

Huo

Ravi

et al. 2011

View full text Add to dashboard Cite

GPUs have rapidly emerged as a very significant player in high performance computing. However, despite the popularity of CUDA, there are significant challenges in porting different classes of HPC applications on modern GPUs. This paper focuses on the challenges of implementing irregular applications arising from unstructured grids on modern NVIDIA GPUs. Considering the importance of irregular reductions in scientific and engineering codes, substantial effort was made in developing compiler and runtime support for parallelization or optimization of these codes in the previous two decades, with different efforts targeting distributed memory machines, distributed shared memory machines, shared memory machines, or cache performance improvement on uniprocessor machines. However, there have not been any systematic studies on parallelizing these applications on modern GPUs.There are at least two significant challenges associated with porting this class of applications on modern GPUs. The first is related to correct and efficient parallelization while using a large number of threads. The second challenge is effective use of shared memory. Since data accesses cannot be determined statically, runtime partitioning methods are needed for effectively using the shared memory. This paper describes an execution methodology that can address the above two challenges. We have also developed optimized runtime modules to support our execution methodology. Our approach and runtime methods have been extensively evaluated using two indirection array based applications.

show abstract

Efficient scheduling of recursive control flow on GPUs

Huo

Krishnamoorthy

Agrawal

2013

View full text Add to dashboard Cite

Graphics processing units (GPUs) have rapidly emerged as a very significant player in high performance computing. Single instruction multiple thread (SIMT) pipelines are typically used in GPUs to exploit parallelism and maximize performance. Although support for unstructured control flow has been included in GPUs, efficiently managing thread divergence for arbitrary parallel programs remains a critical challenge. In this paper, we focus on the problem of supporting recursion in modern GPUs. We design and comparatively evaluate various algorithms to manage thread divergence encountered in recursive programs. The results improve upon traditional post-dominator based reconvergence mechanisms designed to handle thread divergence due to control flow within a procedure.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Xin Huo

Accelerating MapReduce on a coupled CPU-GPU architecture

Porting irregular reductions on heterogeneous CPU-GPU configurations

A programming system for xeon phis with runtime SIMD parallelization

An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs

Efficient scheduling of recursive control flow on GPUs

Contact Info

Product

Resources

About