2010
DOI: 10.1007/s00450-010-0107-3
|View full text |Cite
|
Sign up to set email alerts
|

A middleware for efficient stream processing in CUDA

Abstract: This paper presents a middleware capable of outof-order execution of kernels and data transfers for efficient stream processing in the compute unified device architecture (CUDA). Our middleware runs on the CUDA-compatible graphics processing unit (GPU). Using the middleware, application developers are allowed to easily overlap kernel computation with data transfer between the main memory and the video memory. To maximize the efficiency of this overlap, our middleware performs out-of-order execution of commands… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
7
0

Year Published

2011
2011
2023
2023

Publication Types

Select...
4
4

Relationship

3
5

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 7 publications
0
7
0
Order By: Relevance
“…GPU-chariot extends our previous study [9] in order to reduce development efforts for multi-GPU systems by automating the abovementioned time-consuming tasks. Our framework allows for out-oforder execution of CUDA functions, and realizes efficient software pipelines in four stages: (1) CPU execution, (2) data download from the CPU to the GPU, (3) GPU execution, and (4) data readback from the GPU to the CPU.…”
Section: Introductionmentioning
confidence: 93%
See 1 more Smart Citation
“…GPU-chariot extends our previous study [9] in order to reduce development efforts for multi-GPU systems by automating the abovementioned time-consuming tasks. Our framework allows for out-oforder execution of CUDA functions, and realizes efficient software pipelines in four stages: (1) CPU execution, (2) data download from the CPU to the GPU, (3) GPU execution, and (4) data readback from the GPU to the CPU.…”
Section: Introductionmentioning
confidence: 93%
“…Because StreamIt is a high-level language for stream applications, it relieves programmers from having to write and optimize their kernels to fully utilize the fast memory resources on the GPU chip. Similar frameworks [9]- [11] that can achieve significant acceleration over CPUbased implementations are available; however, the kernels generated by them are for single-GPU systems. Thus, an automated framework that addresses multi-GPU systems and scales application performance according to the number of available GPUs is required.…”
Section: Introductionmentioning
confidence: 99%
“…Therefore, the input data size reaches 256 MB when v = 16. Since this overhead occurs on the CPU, we think that the overhead can be overlapped with kernel execution by using a stream processing technique [5], [13]. In addition to this overhead, both of the kernels reduce the performance to approximately 50%.…”
Section: A Performance Comparison With Previous Schemementioning
confidence: 99%
“…In the last years, the use of scheduler based on many-core or heterogeneous architectures for general or for specific applications has been widely studied [48,27]. S. Yamagiwa et al [48] propose a GPGPU streaming based on distributed computing environment; S. Nakagawa et al [27] provide a new middleware capable of out-of-order execution of works and data transfers using stream processing.…”
mentioning
confidence: 99%
“…S. Yamagiwa et al [48] propose a GPGPU streaming based on distributed computing environment; S. Nakagawa et al [27] provide a new middleware capable of out-of-order execution of works and data transfers using stream processing. Other works [13,46] follow a similar strategy based on streaming to minimize data transfers overhead.…”
mentioning
confidence: 99%