Efficient Inter-Device Task Scheduling Schemes for Multi-Device Co-Processing of Data-Parallel Kernels on Heterogeneous Systems

Wan, Lanjun; Zheng, Weiye; Yuan, Xinpan

doi:10.1109/access.2021.3073955

Cited by 13 publications

(6 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the sake of brevity, the details of HCE runtime system and inter-device scheduling policies will not be described in this article, please refer to our previous work. 17,18 In short, under the help of HeteroPP, programmers only need to focus on how to write data-parallel compute kernels using the extended OpenMP directives and clauses, but do not need to care about the complicated implementation details of multi-device co-computing of data-parallel compute kernels.…”

Section: Overall Design Of Heteroppmentioning

confidence: 99%

“…The runtime system is mainly composed of four components: device management, memory management, task scheduling, and transfer optimization, and each component contains a series of runtime application programming interfaces (APIs). The runtime system currently provides our previously proposed inter‐device scheduling policies 17,18 including FDETS (i.e., the feedback‐based dynamic and elastic task scheduling policy), MFDETS (i.e., the modified FDETS that supports incremental data transfer), ADETS (i.e., the asynchronous‐based dynamic and elastic task scheduling policy), and MADETS (i.e., the modified ADETS that supports three‐way overlapping communication optimization). For the sake of brevity, the details of HCE runtime system and inter‐device scheduling policies will not be described in this article, please refer to our previous work 17,18 …”

Section: Overall Design Of Heteroppmentioning

confidence: 99%

“…As shown in Figure 4, programmers can specify matrix A should be partially copied from host to the GPU and MIC by using the following map clause: map ( part to : A [0:

M\ast L

L

]). In fact, the data distribution across computing devices is automatically addressed by our previously proposed task scheduling policies, 17,18 which can dynamically identify what data needs to be copied to or from which accelerator and when.…”

Section: Extension To Openmp Directives and Clausesmentioning

confidence: 99%

“…In this paper, on the basis of our previously proposed inter‐device task scheduling policies 17 and runtime system 18 that can efficiently support heterogeneous cooperative execution, we propose HeteroPP to provide an easier way to efficiently support the multi‐device co‐computing of data‐parallel compute kernels, which can fully exploit multiple computing devices of the same or different types to concurrently and cooperatively perform data‐parallel compute kernels on heterogeneous platforms. HeteroPP provides a high‐level abstraction of heterogeneous cooperative parallel programming by extending OpenMP.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

HeteroPP: A directive‐based heterogeneous cooperative parallel programming framework

Wan,

Cui,

et al. 2024

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Heterogeneous platforms composed of multiple different types of computing devices (such as CPUs, GPUs, and Intel MICs) have been widely used recently. However, most of parallel applications developed in such a heterogeneous platform usually only utilize a certain kind of computing device due to the lack of easy‐to‐use heterogeneous cooperative parallel programming models. To reduce the difficulty of heterogeneous cooperative parallel programming, a directive‐based heterogeneous cooperative parallel programming framework called HeteroPP is proposed. HeteroPP provides an easier way for programmers to fully exploit multiple different types of computing devices to concurrently and cooperatively perform data‐parallel applications on heterogeneous platforms. An extension to OpenMP directives and clauses is proposed to make it possible for programmers to easily offload a data‐parallel compute kernel to multiple different types of computing devices. A source‐to‐source compiler is designed to help programmers to automatically generate multiple device‐specific compute kernels that can be concurrently and cooperatively performed on heterogeneous platforms. Many experiments are conducted with 12 typical data‐parallel applications implemented with HeteroPP on a heterogeneous CPU‐GPU‐MIC platform. The results show that HeteroPP not only greatly simplifies the heterogeneous cooperative parallel programming, but also can fully utilize the CPUs, GPU, and MIC to efficiently perform these applications.

show abstract

Section: Overall Design Of Heteroppmentioning

confidence: 99%

Section: Overall Design Of Heteroppmentioning

confidence: 99%

“…As shown in Figure 4, programmers can specify matrix A should be partially copied from host to the GPU and MIC by using the following map clause: map ( part to : A [0:

M\ast L

L

Section: Extension To Openmp Directives and Clausesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

HeteroPP: A directive‐based heterogeneous cooperative parallel programming framework

Wan,

Cui,

et al. 2024

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…In SIMD computing, several values are processed simultaneously using a single instruction, contrasting with the typical structure of CPUs. Consequently, GPUs necessitate specific data structure and scheduling approaches to fully exploit their parallel capabilities [12,[18][19][20][21][22][23]. Owing to their relatively smaller memory than the Host, GPUs can only accommodate a subset of the complete graph as input.…”

Section: Introductionmentioning

confidence: 99%

Graph Processing Scheme Using GPU With Value-Driven Differential Scheduling

Song,

Lee,

Kim

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Researchers have recently been using GPUs to process large quantities of graph data. However, the challenges in Host-GPU data transfer must be addressed to effectively use GPUs for graph processing. Although existing frameworks have attempted to mitigate this problem by managing active graph data transfers, issues persist owing to the need to divide graphs into subgraphs for parallel processing across multiple GPU cores. This division often leads to duplicated data transfers, resulting in high transmission overhead and low bandwidth utilization. To address these challenges and expedite graph computation, this study proposes a graph processing scheme using a GPU with value-driven differential scheduling. This approach involves dividing large graphs into subgraphs of similar sizes and contiguous vertices, allowing efficient parallelization on the GPU. The value of each subgraph is assessed based on its activity level, and its computation load is estimated using a differential subgraph scheduling technique. The proposed scheme distinguishes between high-value and low-value subgraphs and allocates them to different graph processing engines. This reduces the redundant data transmissions and enhances the transmission rate of active edges, thereby reducing the Host-GPU data transmission overhead. Experimental results demonstrate that the proposed scheme achieves a notable speedup of up to 6.6 times compared to the existing GPU-accelerated graph processing systems, including GraphCage and Subway.

show abstract

A Survey on Heterogeneous CPU–GPU Architectures and Simulators

Alaei,

Yazdanpanah

2024

Concurrency and Computation

View full text Add to dashboard Cite

Heterogeneous architectures are vastly used in various high performance computing systems from IoT‐based embedded architectures to edge and cloud systems. Although heterogeneous architectures with cooperation of CPUs and GPUs and unified address space are increasingly used, there are still a lot of open questions and challenges regarding the design of these architectures. For evaluation, validation and exploration of next generation of heterogeneous CPU–GPU architectures, it is essential to use unified heterogeneous simulators for analyzing the execution of CPU–GPU workloads. This article presents a systematic review on challenges of heterogeneous CPU–GPU architectures with covering a diverse set of literatures on each challenge. The main considered challenges are shared resource management, network interconnections, task scheduling, energy consumption, and programming model. In addition, in this article, the state‐of‐the‐art of heterogeneous CPU–GPU simulation platforms is reviewed. The structure and characteristics of five cycle‐accurate heterogeneous CPU–GPU simulators are described and compared. We perform comprehensive discussions on the methodologies and challenges of designing high performance heterogeneous architectures. Moreover, for developing efficient heterogeneous CPU–GPU simulators, some recommendations are presented.

show abstract

Efficient Inter-Device Task Scheduling Schemes for Multi-Device Co-Processing of Data-Parallel Kernels on Heterogeneous Systems

Cited by 13 publications

References 37 publications

HeteroPP: A directive‐based heterogeneous cooperative parallel programming framework

HeteroPP: A directive‐based heterogeneous cooperative parallel programming framework

Graph Processing Scheme Using GPU With Value-Driven Differential Scheduling

A Survey on Heterogeneous CPU–GPU Architectures and Simulators

Contact Info

Product

Resources

About