VirtCL: a framework for OpenCL device abstraction and management

You, Yi-Ping; Wu, Hen-Jung; Tsai, Y.-Y.; Chao, Yen-Ting

doi:10.1145/2688500.2688505

Cited by 35 publications

(21 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1 auto kernel = file_read("binomial.cl"); 2 auto samples = 16777216; auto steps = 254; 3 auto steps1 = steps + 1; auto lws = steps1; 4 auto samplesBy4 = samples / 4; 5 auto gws = lws * samplesBy4; 6 vector<cl_float4> in(samplesBy4); 7 vector<cl_float4> out(samplesBy4); 8 9 binomial_init_setup(samplesBy4, in, out); 18 program.in(in); 19 program.out(out); 20 21 program.out_pattern(1, lws); 22 23 program.kernel(kernel, "binomial_opts"); 24 program.arg(0, steps); // positional by index 25 program.arg(in); // aggregate 26 program.arg(out); 27 program.arg(steps1 * sizeof(cl_float4), 28 ecl::Arg::LocalAlloc); 29 program.arg(4, steps * sizeof(cl_float4), 30 ecl::Arg::LocalAlloc); 31 32 engine.use(std::move(program)); 33 34 engine.run(); 35 36 // if (engine.has_errors()) // [Optional lines] 37 // for (auto& err : engine.get_errors()) 38 // show or process errors Listing 1: EngineCL API used in Binomial benchmark.…”

Section: Case 1: Using Only One Devicementioning

confidence: 99%

“…The experiments have been carried out using two different machines to validate both code portability and performance of EngineCL. 1 auto kernel = file_read("nbody.cl"); 2 auto gpu_kernel = file_read("nbody.gpu.cl"); 3 auto phi_kernel_bin = 4 file_read_binary("nbody.phi.cl.bin"); 5 auto bodies = 512000; auto del_t = 0.005f; 6 auto esp_sqr = 500.0f; auto lws = 64; 7 auto gws = bodies; 8 vector<cl_float4> in_pos(bodies); 9 vector<cl_float4> in_vel(bodies); 10 vector<cl_float4> out_pos(bodies); 11 vector<cl_float4> out_vel(bodies); 12 13 nbody_init_setup(bodies, del_t, esp_sqr, in_pos, 14 in_vel, out_pos, out_vel); 15 16 ecl::EngineCL engine; 17 engine.use(ecl::Device(0, 0), 18 ecl::Device(0, 1, phi_kernel_bin), 19 ecl::Device(1, 0, gpu_kernel)); 20 21 engine.work_items(gws, lws); 22 23 auto props = { 0.08, 0.3 }; 24 engine.scheduler(ecl::Scheduler::Static(props)); 25 26 ecl::Program program; 27 program.in(in_pos); 28 program.in(in_vel); 29 program.out(out_pos); 30 program.out(out_vel); 31 32 program.kernel(kernel, "nbody"); 33 program.args(in_pos, in_vel, bodies, del_t, 34 esp_sqr, out_pos, out_vel); 35 36 engine.program(std::move(program)); 37 38 engine.run(); Listing 2: EngineCL API used in NBody benchmark.…”

Section: System Setupmentioning

confidence: 99%

See 1 more Smart Citation

EngineCL: Usability and Performance in Heterogeneous Computing

Nozal

Bosque

Beivide

2020

Future Generation Computer Systems

View full text Add to dashboard Cite

Heterogeneous systems have become one of the most common architectures today, thanks to their excellent performance and energy consumption. However, due to their heterogeneity they are very complex to program and even more to achieve performance portability on different devices. This paper presents EngineCL, a new OpenCL-based runtime system that outstandingly simplifies the co-execution of a single massive data-parallel kernel on all the devices of a heterogeneous system. It performs a set of low level tasks regarding the management of devices, their disjoint memory spaces and scheduling the workload between the system devices while providing a layered API. EngineCL has been validated in two compute nodes (HPC and commodity system), that combine six devices with different architectures. Experimental results show that it has excellent usability compared with OpenCL; a maximum 2.8% of overhead compared to the native version under loads of less than a second of execution and a tendency towards zero for longer execution times; and it can reach an average efficiency of 0.89 when balancing the load.

show abstract

Section: Case 1: Using Only One Devicementioning

confidence: 99%

Section: System Setupmentioning

confidence: 99%

EngineCL: Usability and Performance in Heterogeneous Computing

Nozal

Bosque

Beivide

2020

Future Generation Computer Systems

View full text Add to dashboard Cite

show abstract

“…To exploit heterogeneous multi-core architectures containing cores of two different ISAs such as CPU and GPU, dataparallel applications are potential candidates and they can be developed using the open standard of Open Computing Language (OpenCL) [2] [40], [41]. It also supports exploitation of multiple devices, e.g., CPU and GPU, by a single program.…”

Section: Opencl and Freeoclmentioning

confidence: 99%

Collaborative Adaptation for Energy-Efficient Heterogeneous Mobile SoCs

Singh

Basireddy

Prakash

et al. 2020

IEEE Trans. Comput.

View full text Add to dashboard Cite

Heterogeneous Mobile System-on-Chips (SoCs) containing CPU and GPU cores are becoming prevalent in embedded computing, and they need to execute applications concurrently. However, existing run-time management approaches do not perform adaptive mapping and thread-partitioning of applications while exploiting both CPU and GPU cores at the same time. In this paper, we propose an adaptive mapping and thread-partitioning approach for energy-efficient execution of concurrent OpenCL applications on both CPU and GPU cores while satisfying performance requirements. To start execution of concurrent applications, the approach makes mapping (number of cores and operating frequencies) and partitioning (distribution of threads between CPU and GPU) decisions to satisfy performance requirements for each application. The mapping and partitioning decisions are made by having a collaboration between the CPU and GPU cores' processing capabilities such that balanced execution can be performed. During execution, adaptation is triggered when new application(s) arrive, or an executing one finishes, that frees cores. The adaptation process identifies a new mapping and thread-partitioning in a similar collaborative manner for remaining applications provided it leads to an improvement in energy efficiency. The proposed approach is experimentally validated on the Odroid-XU3 hardware platform with varying set of applications. Results show an average energy saving of 37%, compared to existing approaches while satisfying the performance requirements.

show abstract

“…To achieve this, it is necessary to consider the behaviour of the kernels themselves. When the data-set of a kernel is divided in equally sized portions, or 35 packages, it can be expected that each one will require the same execution time. This happens in well behaved, regular kernels but it is not always the case.…”

Section: Introductionmentioning

confidence: 99%

“…You [35], Zhong [8] and Ashwin [36] do address both load balancing while 660 abstracting the underlying system and data movement. Nevertheless, their focus is on task-parallelism instead of on the co-execution of a single data-parallel kernel.…”

mentioning

confidence: 99%

Auto-tuned OpenCL kernel co-execution in OmpSs for heterogeneous systems

Pérez

Stafford

Bosque

et al. 2019

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

The emergence of heterogeneous systems has been very notable recently. Still their programming is a complex task. The co-execution of a single OpenCL kernel on several devices is a challenging endeavour, requiring considering the different computing capabilities of the devices and application behaviour. OmpSs is a framework for task based parallel applications, that does not support coexecution between several devices. This paper presents an extension of OmpSs that solves two main issues. First, the automatic distribution of datasets and the management of device memory address spaces. Second, the implementation of a set of load balancing algorithms to adapt to the particularities of applications and systems. All this is accomplished with negligible impact on the programming. Experimental results reveal that the use of all the devices in the system is beneficial in terms performance and energy consumption. Also, the Auto-Tune algorithm gives the best overall results without requiring manual parameter tuning. 45 diverse nature of kernels prevents the success of a single data-division strategy in maximising the performance and efficiency of a heterogeneous system. Aside from kernel behaviour, the other key factor for load distribution is the configuration of the heterogeneous system. For the load to be well balanced, each device must get the right amount of work, adapted to the capabilities of 50 the device itself. Therefore, a work distribution that has been hand-tuned for a given system is likely to underperform on a different one. The OmpSs programming model presents a change of paradigm in many ways. It provides support for task parallelism due to its benefits in terms of performance, cross-platform flexibility and reduction of data motion [9]. The 55 programmer divides the code in interrelating tasks and OmpSs essentially orchestrates their parallel execution maintaining their control and data dependences. To that end, OmpSs uses the information supplied by the programmer, via code annotations with pragmas, to determine at run-time which parts of the code can be run in parallel. It enhances OpenMP with support for irregular 60 and asynchronous parallelism, as well as support for heterogeneous architec-

show abstract

VirtCL: a framework for OpenCL device abstraction and management

Cited by 35 publications

References 23 publications

EngineCL: Usability and Performance in Heterogeneous Computing

EngineCL: Usability and Performance in Heterogeneous Computing

Collaborative Adaptation for Energy-Efficient Heterogeneous Mobile SoCs

Auto-tuned OpenCL kernel co-execution in OmpSs for heterogeneous systems

Contact Info

Product

Resources

About