A middleware for efficient stream processing in CUDA

Nakagawa, Shinta; Ino, Fumihiko; Hagihara, Kenichi

doi:10.1007/s00450-010-0107-3

Cited by 9 publications

(7 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GPU-chariot extends our previous study [9] in order to reduce development efforts for multi-GPU systems by automating the abovementioned time-consuming tasks. Our framework allows for out-oforder execution of CUDA functions, and realizes efficient software pipelines in four stages: (1) CPU execution, (2) data download from the CPU to the GPU, (3) GPU execution, and (4) data readback from the GPU to the CPU.…”

Section: Introductionmentioning

confidence: 93%

“…Because StreamIt is a high-level language for stream applications, it relieves programmers from having to write and optimize their kernels to fully utilize the fast memory resources on the GPU chip. Similar frameworks [9]- [11] that can achieve significant acceleration over CPUbased implementations are available; however, the kernels generated by them are for single-GPU systems. Thus, an automated framework that addresses multi-GPU systems and scales application performance according to the number of available GPUs is required.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems

Ino

Nakagawa

Hagihara

2013

IEICE Trans. Inf. & Syst.

Self Cite

View full text Add to dashboard Cite

SUMMARYThis paper presents a stream programming framework, named GPU-chariot, for accelerating stream applications running on graphics processing units (GPUs). The main contribution of our framework is that it realizes efficient software pipelines on multi-GPU systems by enabling out-of-order execution of CPU functions, kernels, and data transfers. To achieve this out-of-order execution, we apply a runtime scheduler that not only maximizes the utilization of system resources but also encapsulates the number of GPUs available in the system. In addition, we implement a load-balancing capability to flow data efficiently through multiple GPUs. Furthermore, a callback interface enables overlapping execution of functions in third-party libraries. By using kernels with different performance bottlenecks, we show that our out-of-order execution is up to 20% faster than in-order execution. Finally, we conduct several case studies on a 4-GPU system and demonstrate the advantages of GPU-chariot over a manually pipelined code. We conclude that GPU-chariot can be useful when developing stream applications with software pipelines on multiple GPUs and CPUs.

show abstract

Section: Introductionmentioning

confidence: 93%

Section: Introductionmentioning

confidence: 99%

GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems

Ino

Nakagawa

Hagihara

2013

IEICE Trans. Inf. & Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Therefore, the input data size reaches 256 MB when v = 16. Since this overhead occurs on the CPU, we think that the overhead can be overlapped with kernel execution by using a stream processing technique [5], [13]. In addition to this overhead, both of the kernels reduce the performance to approximately 50%.…”

Section: A Performance Comparison With Previous Schemementioning

confidence: 99%

Accelerating Parameter Sweep Applications Using CUDA

Motokubota

Ino

Hagihara

2011

2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing

Self Cite

View full text Add to dashboard Cite

This paper proposes a parallelization scheme for parameter sweep (PS) applications using the compute unified device architecture (CUDA). Our scheme focuses on PS applications with irregular access patterns, which usually result in lower performance on the GPU. The key idea to resolve this irregularity is to exploit the similarity of data accesses between different parameters. That is, the scheme simultaneously processes multiple parameters instead of a single parameter. This simultaneous sweep allows data accesses to be coalesced into a single access if the irregularity appears similarly at every parameter. It also reduces the amount of off-chip memory access by using fast on-chip memory for the data commonly accessed for multiple parameters. As a result, the scheme achieves up to 4.5 times higher performance than a naive scheme that processes a single parameter by a kernel invocation.

show abstract

“…In the last years, the use of scheduler based on many-core or heterogeneous architectures for general or for specific applications has been widely studied [48,27]. S. Yamagiwa et al [48] propose a GPGPU streaming based on distributed computing environment; S. Nakagawa et al [27] provide a new middleware capable of out-of-order execution of works and data transfers using stream processing.…”

mentioning

confidence: 99%

“…S. Yamagiwa et al [48] propose a GPGPU streaming based on distributed computing environment; S. Nakagawa et al [27] provide a new middleware capable of out-of-order execution of works and data transfers using stream processing. Other works [13,46] follow a similar strategy based on streaming to minimize data transfers overhead.…”

mentioning

confidence: 99%

Many-Task Computing on Many-Core Architectures

Valero-Lara¹,

Nookala²,

Pelayo³

et al. 2016

SCPE

View full text Add to dashboard Cite

Abstract. Many-Task Computing (MTC) is a common scenario for multiple parallel systems, such as cluster, grids, cloud and supercomputers, but it is not so popular in shared memory parallel processors. In this sense and given the spectacular growth in performance and in number of cores integrated in many-core architectures, the study of MTC on such architectures is becoming more and more relevant. In this paper, authors present what are those programming mechanisms to take advantages of such massively parallel features for the particular target of MTC. Also, the hardware features of the two dominant many-core platforms (NVIDIA's GPUs and Intel Xeon Phi) are also analyzed for our specific framework. Given the important differences in terms of hardware and software in our two many-core platforms, we have considered different strategies based on CUDA (for GPUs) and OpenMP (for Intel Xeon Phi). We carried out several test cases based on an appropriate and widely studied problem for benchmarking as matrix multiplication. Essentially, this study consisted of comparing the time consumed for computing in parallel several tasks one by one (the whole computational resources are used just to compute one task at a time) with the time consumed for computing in parallel the same set of tasks simultaneously (the whole computational resources are used for computing the set of tasks at very same time). Finally, we compared both software-hardware scenarios to identify the most relevant computer features in each of our many-core architectures.

show abstract

A middleware for efficient stream processing in CUDA

Cited by 9 publications

References 7 publications

GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems

GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems

Accelerating Parameter Sweep Applications Using CUDA

Many-Task Computing on Many-Core Architectures

Contact Info

Product

Resources

About