GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications

Laguna, Ignacio; Wood, Paul; Singh, Ranvijay; Bagchi, Saurabh

doi:10.1007/978-3-030-20656-7_12

Cited by 30 publications

(7 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although GPU-CUDA math operations are double-precision capable (Whitehead and Fit-Florea 2011), increased peak performance (e.g. higher speed-up) is found when single-precision operations are used in their place which may differ in precision compared to CPU math operations (Laguna et al 2019). Our findings from table 1 and table 2 show that larger L-M initialization difference at the start of MCMC sampling results in larger S-scores in significantly different distributions.…”

Section: Discussionmentioning

confidence: 99%

Comparison of CPU and GPU Bayesian Estimates of Fibre Orientations from Diffusion MRI

Kim¹,

Lj²,

Hernández-Fernández

et al. 2019

Preprint

View full text Add to dashboard Cite

Background: The correct estimation of fibre orientations is a crucial step for reconstructing human brain tracts. A popular and extensively used tool for this estimation is Bayesian Estimation of Diffusion Parameters Obtained using Sampling Techniques (bedpostx), which is able to estimate several fibre orientations per voxel (i.e. crossing fibres) using Markov Chain Monte Carlo (MCMC). However, for fitting a model in a whole diffusion MRI dataset, MCMC can take up to a day to complete on a standard CPU. Recently, this algorithm has been ported to run on GPUs, which can accelerate the process, completing the analysis in minutes or hours. However, few studies have looked at whether the results from the CPU and GPU algorithms differ. In this study, we compared CPU and GPU bedpostx outputs by running multiple trials of both algorithms on the same whole brain diffusion data and compared each distribution of output using Kolmogorov-Smirnov tests. Results: We show that distributions of fibre fraction parameters and principal diffusion direction angles from bedpostx and bedpostx_gpu display few statistically significant differences in shape and are localized sparsely throughout the whole brain. Average output differences are small in magnitude compared to underlying uncertainty. Conclusions: Despite small amount of differences in samples created between CPU and GPU bedpostx algorithms, results are comparable given the difference in operation order and library usage between CPU and GPU bedpostx.

show abstract

Section: Discussionmentioning

confidence: 99%

Comparison of CPU and GPU Bayesian Estimates of Fibre Orientations from Diffusion MRI

Kim¹,

Lj²,

Hernández-Fernández

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Decreasing the reliance on these massive datacenters and using the edge servers (with the surge of decentralized "datacenters") will allow pre-processing and selective forwarding of processed data sets to the cloud, not only making computing more efficient but also automatically improving data privacy because of the proximity of these servers to the client sites. Other techniques for energy optimization include optimization of DRAM refresh rates on the hardware side [89] and optimizing the mixing of low-and high-precision floating point operations for mixed precision settings using techniques as described recently in the GPUmixer [90].…”

Section: Energy-aware Computingmentioning

confidence: 99%

Resilient Cyberphysical Systems and their Application Drivers: A Technology Roadmap

Chaterji,

Naghizadeh,

Alam

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Cyberphysical systems (CPS) are ubiquitous in our personal and professional lives, and they promise to dramatically improve micro-communities (e.g., urban farms, hospitals), macro-communities (e.g., cities and metropolises), urban structures (e.g., smart homes and cars), and living structures (e.g., human bodies, synthetic genomes). The question that we address in this article pertains to designing these CPS systems to be resilient-from-the-ground-up, and in "learning" iterations, resilientby-reaction. An optimally designed system is resilient to both unique and recurrent attacks with a minimal overhead from "fitting". In the following, our focus is on design and deployment innovations that are broadly applicable across a range of application areas and drivers.We divide our paper into three broad themes. First, we present three prominent application drivers that can lay the basis for the existing and emerging technologies, as follows: smart cities and digital agriculture; planet-scale IoT ; and internet-of-medical-things (IoMT). We select the three application drivers as representative examples of different possible application scenarios that a CPS "designer" may face. These scenarios are orthogonal to each other, operate at different scales, and collectively cover key application domains. Concretely, while the scale of smart cities and digital agriculture on small to large-sized farms can be large, they are not nearly as large as planet-scale IoT, where the scale may pose unique challenges to start with, requiring the consideration of the scale from the outset. For example, a protocol that relies on lots of messages being passed among the participating nodes may work well from an energy efficiency and timeliness standpoint when all the nodes are nearby, say connected through Bluetooth Low Energy (BLE) links. However, when the system is planet-scale and the nodes are dispersed, potentially over large geographic regions, or in highly congested environments, such as urban road environments, such a protocol will become infeasible. In contrast, for IoMT, there is a unique set of challenges with some critical safety requirements as they relate to individuals in medical scenarios.Second, we lay out the foundational technologies-hardware and algorithms-which will power resilient CPS. We base this discussion on two complementary threads for imbuing resilience in these systems, namely, resilience-by-design and resilience-by-reaction. Overall, the notion of resilience can be thought of in the light of three main sources of lack of resilience, as follows: exogenous factors, such as natural variations and attack scenarios; mismatch between engineered designs and exogenous factors ranging from DDoS (distributed denial-of-service) attacks or other cybersecurity

show abstract

“…If it is not possible to avoid temporary data storage or data usage, precision reduction becomes popular [9,23,24,27,29]. Machine learning pushes the introduction of precision reduction [17], but it is natural to exploit new native hardware formats with reduced memory footprint in scientific computations, too.…”

Section: -Terminologymentioning

confidence: 99%

“…data delivery. While the cores with their vector registers can yield an impressive number of computations per second and while there are many cores, we struggle to feed them with data [10,16,17,24,25,27].…”

mentioning

confidence: 99%

A high-level characterisation and generalisation of communication-avoiding programming techniques

Weinzierl

2019

Preprint

View full text Add to dashboard Cite

Today's hardware's explosion of concurrency plus the explosion of data we build upon in both machine learning and scientific simulations have multifaceted impact on how we write our codes. They have changed our notion of performance and, hence, of what a good code is: Good code has, first of all, to be able to exploit the unprecedented levels of parallelism. To do so, it has to manage to move the compute data into the compute facilities on time. As communication and memory bandwidth cannot keep pace with the growth in compute capabilities and as latency increases-at least relative to what the hardware could do-communication-avoiding techniques gain importance. We characterise and classify the field of communication-avoiding algorithms. A review of some examples of communication-avoiding programming by means of our new terminology shows that we are well-advised to broaden our notion of 'communication-avoiding" and to look beyond numerical linear algebra. An abstraction, generalisation and weakening of the term enriches our toolset of how to tackle the data movement challenges. Through this, we eventually gain access to a richer set of tools that we can use to deliver proper code for current and upcoming hardware generations.

show abstract

GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications

Cited by 30 publications

References 15 publications

Comparison of CPU and GPU Bayesian Estimates of Fibre Orientations from Diffusion MRI

Comparison of CPU and GPU Bayesian Estimates of Fibre Orientations from Diffusion MRI

Resilient Cyberphysical Systems and their Application Drivers: A Technology Roadmap

A high-level characterisation and generalisation of communication-avoiding programming techniques

Contact Info

Product

Resources

About