“…Then, the last step of the profiling phase, Step C, is performed. At this point, the acceleration between both offloading modes is computed, determining the best strategy to be used with the device being profiled: Thus, values higher than 1.0 indicate that the device has a beneficial behavior when facing workload splitting strategies, allowing an increase in throughputs by taking advantage of multiple command queues, overlap between computation and communication as well as appropriate interleaving between management and computation, as demonstrated in previous studies [17,19,[21][22][23][24]. And therefore, values lower than 1.0 indicate that it suffers penalization for device management and chunk synchronization, sharing of CPU usage with the simulator itself or other tasks and even an indication of very short execution times, where the generation of multiple chunks is usually counterproductive.…”