Klaus Kofler scite author profile

Kofler, K.; Grasso, I.; Cosenza, B.; Fahringer, T.: An automatic input-sensitive approach for heterogeneous task partitioning. ABSTRACTUnleashing the full potential of heterogeneous systems, consisting of multi-core CPUs and GPUs, is a challenging task due to the difference in processing capabilities, memory availability, and communication latencies of different computational resources.In this paper we propose a novel approach that automatically optimizes task partitioning for different (input) problem sizes and different heterogeneous multi-core architectures. We use the Insieme source-to-source compiler to translate a single-device OpenCL program into a multi-device OpenCL program. The Insieme Runtime System then performs dynamic task partitioning based on an offline-generated prediction model. In order to derive the prediction model, we use a machine learning approach based on Artificial Neural Networks (ANN) that incorporates static program features as well as dynamic, input sensitive features. Principal component analysis have been used to further improve the task partitioning. Our approach has been evaluated over a suite of 23 programs and respectively achieves a performance improvement of 22% and 25% compared to an execution of the benchmarks on a single CPU and a single GPU which is equal to 87.5% of the optimal performance.

show abstract

Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design

Thoman

Kofler

Studt

et al. 2011

View full text Add to dashboard Cite

Automatic Data Layout Optimizations for GPUs

Kofler

Cosenza

Fahringer

2015

View full text Add to dashboard Cite

Memory optimizations have became increasingly important in order to fully exploit the computational power of modern GPUs. The data arrangement has a big impact on the performance, and it is very hard for GPU programmers to identify a well-suited data layout. Classical data layout transformations include grouping together data fields that have similar access patterns, or transforming Array-of-Structures (AoS) to Structure-of-Arrays (SoA). This paper presents an optimization infrastructure to automatically determine an improved data layout for OpenCL programs written in AoS layout. Our framework consists of two separate algorithms: The first one constructs a graph-based model, which is used to split the AoS input struct into several clusters of fields, based on hardware dependent parameters. The second algorithm selects a good per-cluster data layout (e.g., SoA, AoS or an intermediate layout) using a decision tree. Results show that the combination of both algorithms is able to deliver higher performance than the individual algorithms. The layouts proposed by our framework result in speedups of up to 2.22, 1.89 and 2.83 on an AMD FirePro S9000, NVIDIA GeForce GTX 480 and NVIDIA Tesla k20m, respectively, over different AoS sample programs, and up to 1.18 over a manually optimized program.

show abstract

Kd-Tree Based N-Body Simulations with Volume-Mass Heuristic on the GPU

Kofler

Steinhauser

Cosenza

et al. 2014

View full text Add to dashboard Cite

N-body simulations represent an important class of numerical simulations in order to study a wide range of physical phenomena for which researchers demand fast and accurate implementations. Due to the computational complexity, simple brute-force methods to solve the longdistance interaction between bodies can only be used for smallscale simulations. Smarter approaches utilize neighbor lists, tree methods or other hierarchical data structures to reduce the complexity of the force calculations. However, such data structures have complex building algorithms which hamper their parallelization for GPUs.In this paper, we introduce a novel method to effectively parallelize N-body simulations for GPU architectures. Our method is based on an efficient, three-phase, parallel Kd-tree building algorithm and a novel volume-mass heuristic to reduce the simulation time and increase accuracy.Experiments demonstrate that our approach is the fastest monopole implementation with an accuracy that is comparable with state of the art implementations (GADGET-2). In particular, we are able to reach a simulation speed of up to 3 Mparticles/s on a single GPU for the force calculation, while still having a relative force error below 0.4% for 99% of the particles. We also show competitive performance with existing GPU implementations, while our competitor shows worse accuracy behavior as well as a higher energy error during time integration.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Klaus Kofler

Performance and Scalability of GPU-Based Convolutional Neural Networks

An automatic input-sensitive approach for heterogeneous task partitioning

Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design

Automatic Data Layout Optimizations for GPUs

Kd-Tree Based N-Body Simulations with Volume-Mass Heuristic on the GPU

Contact Info

Product

Resources

About