A memory efficient algorithm for adaptive multidimensional integration with multiple GPUs

Arumugam, Kamesh; Godunov, Alexander; Ranjan, Desh; Terzić, Balša; Zubair, Mohammad

doi:10.1109/hipc.2013.6799120

Cited by 8 publications

(7 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the works of Arumugam et al, a parallel algorithm with a deterministic adaptive strategy for the multidimensional integration on GPUs only is presented. The authors focus their attention to the optimization techniques to implement a two‐step procedure: in the first step the algorithm generates a list of sub‐domains of the integration region that are then processed in the second step by using the GPU.…”

Section: Related Workmentioning

confidence: 99%

An adaptive algorithm for high‐dimensional integrals on heterogeneous CPU‐GPU systems

Laccetti

Lapegna

Mele

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary In this paper, we introduce an adaptive procedure for the numerical computation of a high‐dimensional integrals on HPC systems with heterogeneous nodes composed of multi‐core CPU and GPU devices. To this aim, we have integrated together two different approaches: a first one is in charge of a fair workload among the threads running on the multi‐core CPU, while a second one is in charge of an efficient execution of the computational kernels on the GPU. We tested the resulting algorithm on several test functions on a system where the nodes are provided with two Intel ten‐core CPU and one NVIDIA GPU device.

show abstract

Section: Related Workmentioning

confidence: 99%

An adaptive algorithm for high‐dimensional integrals on heterogeneous CPU‐GPU systems

Laccetti

Lapegna

Mele

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…n , along the outer dimension. Adaptive quadrature is traditionally used to compute such partitions, which, as illustrated in [3,4], is characterized by control-flow and memory access irregularities that leads to severe performance bottlenecks on GPU architectures.…”

Section: Forecasting Control-flowmentioning

confidence: 99%

“…Once the partition is computed, integral estimate is calculated using Equation 14. However, in our proposed approach, a single unique partition per class that combines the partition of rp-integral at all grid points of that particular class is calculated using heuristics instead of using traditional adaptive quadrature methods on each point.The main motivations for calculating such unique partition for a group of points instead of individual grid-point is that it eliminates the need for adaptive quadrature or data-dependent control-flow on each integral evaluation, which, as illustrated in[3,4,5], is the main performance bottleneck for such adaptive computations on SIMD architectures. The procedure RP-IntegralPartition implements this heuristics approach, where for each class c ∈ C, it generates a unique partition P [1..P.length] that denotes a rp-integral partition along the outer integration domain (r -domain).Ideally, P should be a combination of the partitions generated by rp-integral at all p ∈ c. However, computing such partition per class instead of individual grid-point is computationally challenging due to the data-dependent, and irregular control-flow behavior of different rp-integrals.…”

mentioning

confidence: 99%

Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures

Arumugam

2017

Self Cite

View full text Add to dashboard Cite

Efficient parallel implementations of scientific applications on multi-core CPUs with accelerators such as GPUs and Xeon Phis is challenging. This requires-exploiting the data parallel architecture of the accelerator along with the vector pipelines of modern x86 CPU architectures, load balancing, and efficient memory transfer between different devices. It is relatively easy to meet these requirements for highlystructured scientific applications. In contrast, a number of scientific and engineering applications are unstructured. Getting performance on accelerators for these applications is extremely challenging because many of these applications employ irregular algorithms which exhibit data-dependent control-flow and irregular memory accesses. Furthermore, these applications are often iterative with dependency between steps, and thus making it hard to parallelize across steps. As a result, parallelism in these applications is often limited to a single step. Numerical simulation of charged

show abstract

“…Unfortunately, this approach is infeasible for higher dimensions as 𝑑 𝑛 grows exponentially with 𝑛. For example if 𝑛 = 10 and we need to split each dimension into 𝑑 = 20 parts the number of sub-regions created would be 20 10 which is roughly 10 13 . Moreover, uniform division of the integration region is not the best way to estimate the integral.…”

Section: Introductionmentioning

confidence: 99%

“…We propose a new deterministic, parallel adaptive algorithm for multi-dimensional integration for massively parallel architectures. It is inspired by the Cuhre method of the Cuba library first introduced in [2] and its parallel GPU-adaptation [6] [10]. Unlike other parallel methods such as [6] [10], the proposed PAGANI algorithm does not utilize the common sequential scheme seen in adaptive integration.…”

Section: Introductionmentioning

confidence: 99%

PAGANI: A Parallel Adaptive GPU Algorithm for Numerical

Sakiotis¹,

Arumugam²,

Paterno³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We present a new adaptive parallel algorithm for the challenging problem of multi-dimensional numerical integration on massively parallel architectures. Adaptive algorithms have demonstrated the best performance, but efficient many-core utilization is difficult to achieve because the adaptive work-load can vary greatly across the integration space and is impossible to predict a priori. Existing parallel algorithms utilize sequential computations on independent processors, which results in bottlenecks due to the need for data redistribution and processor synchronization. Our algorithm employs a high-throughput approach in which all existing sub-regions are processed and sub-divided in parallel. Repeated sub-region classification and filtering improves upon a brute-force approach and allows the algorithm to make efficient use of computation and memory resources. A CUDA implementation shows orders of magnitude speedup over the fastest open-source CPU method and extends the achievable accuracy for difficult integrands. Our algorithm typically outperforms other existing deterministic parallel methods.

show abstract

A memory efficient algorithm for adaptive multidimensional integration with multiple GPUs

Cited by 8 publications

References 12 publications

An adaptive algorithm for high‐dimensional integrals on heterogeneous CPU‐GPU systems

An adaptive algorithm for high‐dimensional integrals on heterogeneous CPU‐GPU systems

Efficient Machine Learning Approach for Optimizing Scientific Computing Applications on Emerging HPC Architectures

PAGANI: A Parallel Adaptive GPU Algorithm for Numerical

Contact Info

Product

Resources

About