Peter Gottschling scite author profile

Peter Gottschling

5Publications

98Citation Statements Received

69Citation Statements Given

How they've been cited

182

How they cite others

Affiliations

TU Dresden, Indiana University Bloomington, University of Hertfordshire

Publications

Order By: Most citations

Programming CUDA and OpenCL: A Case Study Using Modern C++ Libraries

Demidov¹,

Ahnert²,

Rupp³

et al. 2013

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

Abstract. We present a comparison of several modern C ++ libraries providing high-level interfaces for programming multi-and many-core architectures on top of CUDA or OpenCL. The comparison focuses on the solution of ordinary differential equations and is based on odeint, a framework for the solution of systems of ordinary differential equations. Odeint is designed in a very flexible way and may be easily adapted for effective use of libraries such as MTL4, VexCL, or ViennaCL, using CUDA or OpenCL technologies. We found that CUDA and OpenCL work equally well for problems of large sizes, while OpenCL has higher overhead for smaller problems. Furthermore, we show that modern high-level libraries allow to effectively use the computational resources of many-core GPUs or multi-core CPUs without much knowledge of the underlying technologies.Key words. GPGPU, OpenCL, CUDA, C ++ , Boost.odeint, MTL4, VexCL, ViennaCL AMS subject classifications. 34-04, 65-04, 65Y05, 65Y10, 97N801. Introduction. Recently, general purpose computing on graphics processing units (GPGPU) has acquired considerable momentum in the scientific community. This is confirmed both by increasing numbers of GPGPU-related publications and GPU-based supercomputers in the TOP500 1 list. Major programming frameworks are NVIDIA CUDA and OpenCL. The former is a proprietary parallel computing architecture developed by NVIDIA for general purpose computing on NVIDIA graphics adapters, and the latter is an open, royalty-free standard for cross-platform, parallel programming of modern processors and GPUs maintained by the Khronos group. By nature, the two frameworks have their distinctive pros and cons. CUDA has a more mature programming environment with a larger set of scientific libraries, but is available for NVIDIA hardware only. OpenCL is supported on a wide range of hardware, but its native API requires a much larger amount of boilerplate code from the developer. Another problem with OpenCL is that it is generally difficult to achieve performance portability across different hardware architectures.Both technologies are able to provide scientists with the vast computational resources of modern GPUs at the price of a steep learning curve. Programmers need to familiarize themselves with a new programming language and, more importantly, with a new programming paradigm. However, the entry barrier may be lowered with the help of specialized libraries. The CUDA Toolkit includes several such libraries (BLAS implementations, Fast Fourier Transform, Thrust and others). OpenCL lacks standard libraries, but there are a number of third-party projects aimed at developing both CUDA and OpenCL programs. This paper presents a comparison of several modern C ++ libraries aimed at ease of GPGPU development. We look at both convenience and performance of the li-

show abstract

Optimizing a conjugate gradient solver with non-blocking collective operations

et al. 2007

View full text Add to dashboard Cite

Abstract. This paper presents a case study about the applicability and usage of non-blocking collective operations. These operations provide the ability to overlap communication with computation and to avoid unnecessary synchronization. We introduce our NBC library, a portable lowoverhead implementation of non-blocking collectives on top of MPI-1. We demonstrate the easy usage of the NBC library with the optimization of a conjugate gradient solver with only minor changes to the traditional parallel implementation of the program. The optimized solver runs up to 34% faster and is able to overlap most of the communication. We show that there is, due to the overlap, no performance difference between Gigabit Ethernet and InfiniBand TM for our calculation.

show abstract

Representation-transparent matrix algorithms with scalable performance

Gottschling

Wise

Adams

2007

View full text Add to dashboard Cite

Positive results from new object-oriented tools for scientific programming are reported. Using template classes, abstractions of matrix representations are available that subsume conventional row-major, column-major, either Z-or IMorton-order, as well as block-wise combinations of these. Moreover, the design of the Matrix Template Library (MTL) has been independently extended to provide recursators, to support block-recursive algorithms, supplementing MTL's iterators. Data types modeling both concepts enable the programmer to implement both iterative and recursive algorithms (or even both) on all of the aforementioned matrix representations at once for a wide family of important scientific operations.We illustrate the unrestricted applicability of our matrixrecursator on matrix multiplication. The same generic blockrecursive function, unaltered, is instantiated on different triplets of matrix types. Within a base block, either a library multiplication or a user-provided, platform-specific code provides excellent performance. We achieve 77% of peak-performance using hand-tuned base cases without explicit prefetching. This excellent performance becomes available over a wide family of matrix representations from a single program. The techniques generalize to other applications in linear algebra.

show abstract

Generic support of algorithmic and structural recursion for scientific computing1

Gottschling

Wise

Joshi

2009

International Journal of Parallel, Emergent and Distributed Sys

View full text Add to dashboard Cite

Leveraging non-blocking collective communication in high-performance applications

Hoefler

Gottschling

Lumsdaine

2008

View full text Add to dashboard Cite

Although overlapping communication with computation is an important mechanism for achieving high performance in parallel programs, developing applications that actually achieve good overlap can be difficult. Existing approaches are typically based on manual or compiler-based transformations. This paper presents a pattern and library-based approach to optimizing collective communication in parallel high-performance applications, based on using non-blocking collective operations to enable overlapping of communication and computation. Common communication and computation patterns in iterative SPMD computations are used to motivate the transformations we present. Our approach provides the programmer with the capability to separately optimize communication and computation in an application, while automating the interaction between computation and communication to achieve maximum overlap. Performance results with a model application show more than a 90% decrease in communication overhead, resulting in 21% overall performance improvements.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.