MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL

Aji, Ashwin M.; Peña, Antonio J.; Balaji, Pavan; Feng, Wu-chun

doi:10.1016/j.parco.2016.05.006

Cited by 23 publications

(13 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To address the limitations and the idle-cycles introduced by the multi-devices in-order execution mode of OpenCL, a number of frameworks has been proposed. For instance, VirtCL [53], SnuCL [34], PySchedCL [21], FluidiCL [42], Mul-tiCL [2], EngineCL [39] and SOCL [26] focus on single or multi-task level scheduling for standalone or partitioned OpenCL applications. A common denominator of all aforementioned frameworks is the fact that they solely focus on non-managed applications, thereby leaving the area of managed languages unexplored.…”

Section: Opencl Execution Modesmentioning

confidence: 99%

Multiple-tasks on multiple-devices (MTMD): exploiting concurrency in heterogeneous managed runtimes

Papadimitriou

Markou

Fumero

et al. 2021

Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

View full text Add to dashboard Cite

Modern commodity devices are nowadays equipped with a plethora of heterogeneous devices serving different purposes. Being able to exploit such heterogeneous hardware accelerators to their full potential is of paramount importance in the pursuit of higher performance and energy efficiency. Towards these objectives, the reduction of idle time of each device as well as the concurrent program execution across different accelerators can lead to better scalability within the computing platform.In this work, we propose a novel approach for enabling a Java-based heterogeneous managed runtime to automatically and efficiently deploy multiple tasks on multiple devices. We extend TornadoVM with parallel execution of bytecode interpreters to dynamically and concurrently manage and execute arbitrary tasks across multiple OpenCL-compatible devices. In addition, in order to achieve an efficient devicetask allocation, we employ a machine learning approach with a multiple-classification architecture of Extra-Trees-Classifiers. Our proposed solution has been evaluated over a suite of 12 applications split into three different groups. Our experimental results showcase performance improvements up 83% compared to all tasks running on the single best device, while reaching up to 91% of the oracle performance.

show abstract

Section: Opencl Execution Modesmentioning

confidence: 99%

Multiple-tasks on multiple-devices (MTMD): exploiting concurrency in heterogeneous managed runtimes

Papadimitriou

Markou

Fumero

et al. 2021

Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

View full text Add to dashboard Cite

show abstract

“…1 auto kernel = file_read("binomial.cl"); 2 auto samples = 16777216; auto steps = 254; 3 auto steps1 = steps + 1; auto lws = steps1; 4 auto samplesBy4 = samples / 4; 5 auto gws = lws * samplesBy4; 6 vector<cl_float4> in(samplesBy4); 7 vector<cl_float4> out(samplesBy4); 8 9 binomial_init_setup(samplesBy4, in, out); 18 program.in(in); 19 program.out(out); 20 21 program.out_pattern(1, lws); 22 23 program.kernel(kernel, "binomial_opts"); 24 program.arg(0, steps); // positional by index 25 program.arg(in); // aggregate 26 program.arg(out); 27 program.arg(steps1 * sizeof(cl_float4), 28 ecl::Arg::LocalAlloc); 29 program.arg(4, steps * sizeof(cl_float4), 30 ecl::Arg::LocalAlloc); 31 32 engine.use(std::move(program)); 33 34 engine.run(); 35 36 // if (engine.has_errors()) // [Optional lines] 37 // for (auto& err : engine.get_errors()) 38 // show or process errors Listing 1: EngineCL API used in Binomial benchmark.…”

Section: Case 1: Using Only One Devicementioning

confidence: 99%

“…The experiments have been carried out using two different machines to validate both code portability and performance of EngineCL. 1 auto kernel = file_read("nbody.cl"); 2 auto gpu_kernel = file_read("nbody.gpu.cl"); 3 auto phi_kernel_bin = 4 file_read_binary("nbody.phi.cl.bin"); 5 auto bodies = 512000; auto del_t = 0.005f; 6 auto esp_sqr = 500.0f; auto lws = 64; 7 auto gws = bodies; 8 vector<cl_float4> in_pos(bodies); 9 vector<cl_float4> in_vel(bodies); 10 vector<cl_float4> out_pos(bodies); 11 vector<cl_float4> out_vel(bodies); 12 13 nbody_init_setup(bodies, del_t, esp_sqr, in_pos, 14 in_vel, out_pos, out_vel); 15 16 ecl::EngineCL engine; 17 engine.use(ecl::Device(0, 0), 18 ecl::Device(0, 1, phi_kernel_bin), 19 ecl::Device(1, 0, gpu_kernel)); 20 21 engine.work_items(gws, lws); 22 23 auto props = { 0.08, 0.3 }; 24 engine.scheduler(ecl::Scheduler::Static(props)); 25 26 ecl::Program program; 27 program.in(in_pos); 28 program.in(in_vel); 29 program.out(out_pos); 30 program.out(out_vel); 31 32 program.kernel(kernel, "nbody"); 33 program.args(in_pos, in_vel, bodies, del_t, 34 esp_sqr, out_pos, out_vel); 35 36 engine.program(std::move(program)); 37 38 engine.run(); Listing 2: EngineCL API used in NBody benchmark.…”

Section: System Setupmentioning

confidence: 99%

EngineCL: Usability and Performance in Heterogeneous Computing

Nozal

Bosque

Beivide

2020

Future Generation Computer Systems

View full text Add to dashboard Cite

Heterogeneous systems have become one of the most common architectures today, thanks to their excellent performance and energy consumption. However, due to their heterogeneity they are very complex to program and even more to achieve performance portability on different devices. This paper presents EngineCL, a new OpenCL-based runtime system that outstandingly simplifies the co-execution of a single massive data-parallel kernel on all the devices of a heterogeneous system. It performs a set of low level tasks regarding the management of devices, their disjoint memory spaces and scheduling the workload between the system devices while providing a layered API. EngineCL has been validated in two compute nodes (HPC and commodity system), that combine six devices with different architectures. Experimental results show that it has excellent usability compared with OpenCL; a maximum 2.8% of overhead compared to the native version under loads of less than a second of execution and a tendency towards zero for longer execution times; and it can reach an average efficiency of 0.89 when balancing the load.

show abstract

“…Introduced as an open standard, OpenCL is also designed for programming heterogeneous parallel systems. Some extensions exist [29] to enable the average OpenCL programmer to focus on the algorithm design rather than scheduling and to automatically gain performance without sacrificing programmability. After coding and running programs, it's important to evaluate the efficiency, the scalability and the portability of the code by using performance metrics for parallel programs (Def.…”

Section: Fundamental Basis For Parallelizationmentioning

confidence: 99%

Concurrent computation of topological watershed on shared memory parallel machines

2017

View full text Add to dashboard Cite

The watershed transform is considered as the most appropriate method for image segmentation in the field of mathematical morphology. In the following paper, we present an adapted topological watershed algorithm suited for a rapid and effective implementation on Shared Memory Parallel Machine (SMPM). The introduced algorithm allows a parallel watershed computing while preserving the given topology. No prior minima extraction is needed, nor the use of any sorting step or hierarchical queue. The strategy that guides the parallel watershed computing, labeled SDM-Strategy (equivalent to Split-Distributes and Merge), is also presented. Experimental analyses such as execution time, performance enhancement, cache consumption, efficiency and scalability are also presented and discussed.

show abstract

MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL

Cited by 23 publications

References 22 publications

Multiple-tasks on multiple-devices (MTMD): exploiting concurrency in heterogeneous managed runtimes

Multiple-tasks on multiple-devices (MTMD): exploiting concurrency in heterogeneous managed runtimes

EngineCL: Usability and Performance in Heterogeneous Computing

Concurrent computation of topological watershed on shared memory parallel machines

Contact Info

Product

Resources

About