Task parallel programming model + hardware acceleration = performance advantage

Dallou, Tamer; Lucas, Divino César Soares; Araújo, Guido; Morais, Lucas; Barbosa, Eduardo Ferreira; Frank, Michael; Bagley, Richard; Sayana, Raj

doi:10.1109/hotchips.2016.7936235

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The earliest solutions consisted of processor extensions for improving scheduling of dependence-less tasks. Then, as StarSs and later OpenMP 4.0 introduced tasks with data dependencies [11,17], new architectures were proposed for reducing task graph management overhead, and HDL implementations of several of these architectures were conceived [7,9,22,23,24]. Kumar et al [15] developed hardware task queues that could be used for accelerating the dynamic scheduling of tasks with only parent/child dependencies.…”

Section: Related Workmentioning

confidence: 99%

Adding Tightly-Integrated Task Scheduling Acceleration to a RISC-V Multi-core Processor

Morais

Silva

Goldman

et al. 2019

Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Self Cite

View full text Add to dashboard Cite

Task Parallelism is a parallel programming model that provides code annotation constructs to outline tasks and describe how their pointer parameters are accessed so that they might be executed in parallel, and asynchronously, by a runtime capable of inferring and honoring their data dependence relationships. It is supported by several parallelization frameworks, as OpenMP and StarSs.Overhead related to automatic dependence inference and to the scheduling of ready-to-run tasks is a major performance limiting factor of Task Parallel systems. To amortize this overhead, programmers usually trade the higher parallelism that could be leveraged from finer-grained work partitions for the higher runtime-efficiency of coarser-grained work partitions. Such problems are even more severe for systems with many cores, as the task spawning frequency required for preserving cores from starvation grows linearly with their number.To mitigate these problems, researchers have designed hardware accelerators to improve runtime performance. Nevertheless, the high CPU-accelerator communication overheads of these solutions hampered their gains.We thus propose a RISC-V based architecture that minimizes communication overhead between the HW Task Scheduler and the CPU by allowing Task Scheduling software to directly interact with the former through custom instructions. Empirical evaluation of the architecture is made possible by an FPGA prototype featuring an eight-core Linux-capable Rocket Chip implementing such instructions.To evaluate the prototype performance, we both (1) adapted Nanos, a mature Task Scheduling runtime, to benefit from the new task-scheduling-accelerating instructions; and (2) developed Phentos, a new HW-accelerated light weight Task Scheduling runtime. Our experiments show that task parallel programs using Nanos-RV -the Nanos version ported to our system -are on average 2.13 times faster than those being serviced by baseline Nanos, while programs running on Phentos are 13.19 times faster, considering geometric means. Using eight cores, Nanos-RV is able to deliver speedups with respect to serial execution of up to 5.62 times, while Phentos produces speedups of up to 5.72 times.

show abstract

Section: Related Workmentioning

confidence: 99%

Adding Tightly-Integrated Task Scheduling Acceleration to a RISC-V Multi-core Processor

Morais

Silva

Goldman

et al. 2019

Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Self Cite

View full text Add to dashboard Cite

show abstract

“…Nonetheless, as the analysis of Section 3 shall demonstrate, the performance of these systems is severely degraded when they are used to serve task applications generating fine-granularity tasks -that is, tasks with execution times in the range from 1 to 100us. Accelerator-based Task Scheduling systems aim to improve Task Scheduling performance by implementing several scheduling actions in an FPGA-based accelerator, which interacts with task applications through the API provided by a lightweight SW Runtime [Yazdanpanah et al 2015, Dallou et al 2013, Wang et al 2013, Bamnote and Nerkar 2015, Dallou et al 2016]. Such organization is depicted in Fig.…”

Section: Software-based Task Scheduling (Sw-ts)mentioning

confidence: 99%

Using Petri-Net Modelling to Support the Case for HW-Assisted Task Scheduling

Morais¹,

Goldman²,

Araújo³

2017

Anais Do XVIII Simpósio Em Sistemas Computacionais De Alto Desempenho (SSCAD 2017)

Self Cite

View full text Add to dashboard Cite

Given the pervasiveness of multi-core processors in systems from various domains, the need for efficient parallelization tools has only increased during the last decade. Among the paradigms built to answer this demand, Task Parallelism stands out as a highly productive tool for leveraging data parallelism with minimum code altering. Nonetheless, its current supporting runtimes cannot efficiently execute workloads involving tasks in the fine 1-100us range, limiting its applicability. That said, by performing a thorough Petri-Net-based analysis of task parallel systems with several degrees of HW-assistance, we show that the development of Native CPU support for Task Parallelism is the key for efficiently serving these challenging workloads.

show abstract

Task parallel programming model + hardware acceleration = performance advantage

Cited by 2 publications

References 0 publications

Adding Tightly-Integrated Task Scheduling Acceleration to a RISC-V Multi-core Processor

Adding Tightly-Integrated Task Scheduling Acceleration to a RISC-V Multi-core Processor

Using Petri-Net Modelling to Support the Case for HW-Assisted Task Scheduling

Contact Info

Product

Resources

About