Daniel Lowell scite author profile

Daniel Lowell

5Publications

42Citation Statements Received

98Citation Statements Given

How they've been cited

How they cite others

Affiliations

Advanced Micro Devices (United States), Argonne National Laboratory, The Ohio State University

Publications

Order By: Most citations

Autotuning Stencil-Based Computations on GPUs

Mametjanov

Lowell

et al. 2012

View full text Add to dashboard Cite

Abstract-Finite-difference, stencil-based discretization approaches are widely used in the solution of partial differential equations describing physical phenomena. Newton-Krylov iterative methods commonly used in stencil-based solutions generate matrices that exhibit diagonal sparsity patterns.To exploit these structures on modern GPUs, we extend the standard diagonal sparse matrix representation and define new matrix and vector data types in the PETSc parallel numerical toolkit. We create tunable CUDA implementations of the operations associated with these types after identifying a number of GPU-specific optimizations and tuning parameters for these operations. We discuss our implementation of GPU autotuning capabilities in the Orio framework and present performance results for several kernels, comparing them with vendor-tuned library implementations.

show abstract

Compiler Techniques to Reduce the Synchronization Overhead of GPU Redundant Multithreading

Gupta

Lowell

Kalamatianos

et al. 2017

View full text Add to dashboard Cite

Redundant Multi-Threading (RMT) provides a potentially low cost mechanism to increase GPU reliability by replicating computation at the thread level. Prior work has shown that RMT's high performance overhead stems not only from executing redundant threads, but also from the synchronization overhead between the original and redundant threads. The overhead of inter-thread synchronization can be especially significant if the synchronization is implemented using global memory. This work presents novel compiler techniques using fingerprinting and cross-lane operations to reduce synchronization overhead for RMT on GPUs. Fingerprinting combines multiple synchronization events into one event by hashing, and cross-lane operations enable thread-level synchronization via register-level communication. This work shows that fingerprinting yields a 73.5% reduction in GPU RMT overhead while cross-lane operations reduce the overhead by 43% when compared to the state-of-the-art GPU RMT solutions on real hardware.

show abstract

MIOpen: An Open Source Library For Deep Learning Primitives

Khan¹,

Fultz²,

Tamazov³

et al. 2020

View full text Add to dashboard Cite

Deep Learning has established itself to be a common occurrence in the business lexicon. The unprecedented success of deep learning in recent years can be attributed to: an abundance of data, availability of gargantuan compute capabilities offered by GPUs, and adoption of open-source philosophy by the researchers and industry. Deep neural networks can be decomposed into a series of different operators. MIOpen, AMD's open-source deep learning primitives library for GPUs, provides highly optimized implementations of such operators, shielding researchers from internal implementation details and hence, accelerating the time to discovery. This paper introduces MIOpen and provides details about the internal workings of the library and supported features. MIOpen innovates on several fronts, such as implementing fusion to optimize for memory bandwidth and GPU launch overheads, providing an auto-tuning infrastructure to overcome the large design space of problem configurations, and implementing different algorithms to optimize convolutions for different filter and input sizes. MIOpen is one of the first libraries to publicly support the bfloat16 data-type for convolutions, allowing efficient training at lower precision without the loss of accuracy.

show abstract

Stencil-Aware GPU Optimization of Iterative Solvers

Lowell

Godwin

Holewinski

et al. 2013

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

Abstract. Numerical solutions of nonlinear partial differential equations frequently rely on iterative Newton-Krylov methods, which linearize a finite-difference stencil-based discretization of a problem, producing a sparse matrix with regular structure. Knowledge of this structure can be used to exploit parallelism and locality of reference on modern cache-based multi-and manycore architectures, achieving high performance for computations underlying commonly used iterative linear solvers. In this paper we describe our approach to sparse matrix data structure design and our implementation of the kernels underlying iterative linear solvers in PETSc. We also describe autotuning of CUDA implementations based on high-level descriptions of the stencil-based matrix and vector operations.Key words. structured grid, sparse matrix format, iterative solvers, autotuning, GPGPU, PETSc AMS subject classifications. 65Y10, 65F50, 15A06, 68N191. Introduction. Many scientific applications rely on high-performance numerical libraries, such as Hypre [17], PETSc [5][6][7], SuperLU [19], and Trilinos [27], for providing accurate and fast solutions to problems modeled by using nonlinear partial differential equations (PDEs). Thus, the bulk of the burden in achieving good performance and portability is placed on the library implementors, largely freeing computational scientists from low-level performance optimization and portability concerns. At the same time, the increasing availability of hybrid CPU/accelerator architectures is making the task of providing both portability and high performance in both libraries and applications increasingly challenging. The latest Top500 list [2] contains thirtynine supercomputing systems with GPGPUs. Amazon has announced the availability of Cluster GPU Instances for Amazon EC2. More and more researchers have access to GPU clusters instead of CPU clusters for large-scale computation problems in areas such as high energy physics, scientific simulation, data mining, climate forecast, and earthquake prediction. Relying entirely on compilers for code optimization does not produce satisfactory results, in part because the languages in which libraries are implemented (C, C++, Fortran) fail to expose sufficient information required for aggressive optimizations, and in part because of the tension between software design and performance-a well-engineered, dynamically extensible library is typically much more difficult to optimize through traditional compiler approaches.

show abstract

MIOpen: An Open Source Library For Deep Learning Primitives

Khan¹,

Fultz²,

Tamazov³

et al. 2019

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Daniel Lowell

Autotuning Stencil-Based Computations on GPUs

Compiler Techniques to Reduce the Synchronization Overhead of GPU Redundant Multithreading

MIOpen: An Open Source Library For Deep Learning Primitives

Stencil-Aware GPU Optimization of Iterative Solvers

MIOpen: An Open Source Library For Deep Learning Primitives

Contact Info

Product

Resources

About