Mohamed Wahib scite author profile

Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU.CCS Concepts • Software and its engineering → Source code generation.

show abstract

Hybrid of genetic algorithm and local search to solve MAX-SAT problem using nVidia CUDA framework

Munawar

Wahib

Munetomo

et al. 2009

Genet Program Evolvable Mach

View full text Add to dashboard Cite

General Purpose computing over Graphical Processing Units (GPGPUs) is a huge shift of paradigm in parallel computing that promises a dramatic increase in performance. But GPGPUs also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges and design choices involved in parallelizing a hybrid of Genetic Algorithm (GA) and Local Search (LS) to solve MAXimum SATisfiability (MAX-SAT) problem on a state-of-the-art nVidia Tesla GPU using nVidia Compute Unified Device Architecture (CUDA). MAX-SAT is a problem of practical importance and is often solved by employing metaheuristics based search methods like GAs and hybrid of GA with LS. Almost all the parallel GAs (pGAs) designed in the last two decades were designed for either clusters or MPPs. Unfortunately, very little research is done on the implementation of such algorithms over commodity graphics hardware. GAs in their simple form are not suitable for implementation over the Single Instruction Multiple Thread (SIMT) architecture of a GPU, same is the case with conventional LS algorithms. In this paper we explore different genetic operators that can be used for an efficient implementation of GAs over nVidia GPUs. We also design and introduce new techniques/operators for an efficient implementation of GAs and LS over such architectures. We use nVidia Tesla C1060 to perform several numerical tests and performance measurements and show that in the best case we obtain a speedup of 25x. We also discuss the effects of different optimization techniques on the overall execution time.

show abstract

A versatile software systolic execution model for GPU memory-bound kernels

Chen

Wahib

Takizawa

et al. 2019

View full text Add to dashboard Cite

This paper proposes a versatile high-performance execution model, inspired by systolic arrays, for memory-bound regular kernels running on CUDA-enabled GPUs. We formulate a systolic model that shifts partial sums by CUDA warp primitives for the computation. We also employ register files as a cache resource in order to operate the entire model efficiently. We demonstrate the effectiveness and versatility of the proposed model for a wide variety of stencil kernels that appear commonly in HPC, and also convolution kernels (increasingly important in deep learning workloads). Our algorithm outperforms the top reported state-of-the-art stencil implementations, including implementations with sophisticated temporal and spatial blocking techniques, on the two latest Nvidia architectures: Tesla V100 and P100. For 2D convolution of general filter sizes and shapes, our algorithm is on average 2.5× faster than Nvidia's NPP on V100 and P100 GPUs. CCS CONCEPTS• Computer systems organization → Systolic arrays; Multicore architectures.

show abstract

Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads

Nguyen

Wahib

Takano

2018

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Mohamed Wahib

Scalable Kernel Fusion for Memory-Bound GPU Applications

AN5D: automated stencil framework for high-degree temporal blocking on GPUs

Hybrid of genetic algorithm and local search to solve MAX-SAT problem using nVidia CUDA framework

A versatile software systolic execution model for GPU memory-bound kernels

Hierarchical Distributed-Memory Multi-Leader MPI-Allreduce for Deep Learning Workloads

Contact Info

Product

Resources

About