Task Superscalar: An Out-of-Order Task Pipeline

Etsion, Yoav; Cabarcas, Felipe; Rico, Alejandro; Ramírez, Alex; Badía, Rosa M.; Ayguadé, Eduard; Labarta, Jesús; Valero, Mateo

doi:10.1109/micro.2010.13

Cited by 108 publications

(105 citation statements)

References 19 publications

Supporting

Mentioning

104

Contrasting

Unclassified

Order By: Relevance

“…We evaluate our proposals using an in-house trace-driven simulator, based on the methodology of [9], that models a multicore CPU connected to a discrete GPU through a PCIe bus. The simulator performs a coarse-grained modeling of the CPU, tracing the execution of our benchmarks on an Intel Core i7 930 chip.…”

Section: Methodsmentioning

confidence: 99%

Enabling preemptive multiprogramming on GPUs

Tanasic

Gelado

Cabezas

et al. 2014

SIGARCH Comput. Archit. News

Self Cite

View full text Add to dashboard Cite

GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service.In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.

show abstract

Section: Methodsmentioning

confidence: 99%

Enabling preemptive multiprogramming on GPUs

Tanasic

Gelado

Cabezas

et al. 2014

SIGARCH Comput. Archit. News

Self Cite

View full text Add to dashboard Cite

show abstract

“…The Task Superscalar [4] architecture was the first one to address this problem, proposing a decoupled model in which different finite state machines (modules) manage the most cumbersome functionalities of the runtime. The first implementation of the Task Superscalar architecture, the Hardware Task Superscalar, has already demonstrated high potential [5].…”

Section: Introductionmentioning

confidence: 99%

Picos: A hardware runtime architecture support for OmpSs

Yazdanpanah

Álvarez

Jiménez-González

et al. 2015

Future Generation Computer Systems

Self Cite

View full text Add to dashboard Cite

OmpSs is a programming model that provides a simple and powerful way of annotating sequential programs to exploit heterogeneity and task parallelism based on runtime data dependency analysis, dataflow scheduling and out-of-order task execution; it has greatly influenced Version 4.0 of the OpenMP standard. The current implementation of OmpSs achieves those capabilities with a puresoftware runtime library: Nanos++. Therefore, although powerful and easy to use, the performance benefits of exploiting finegrained (pico) task parallelism are limited by the software runtime overheads. To overcome this handicap we propose Picos, an implementation of the Task Superscalar (TSS) architecture that provides hardware support to the OmpSs programming model. Picos is a novel hardware dataflow-based task scheduler that dynamically analyses inter-task dependencies and identifies task-level parallelism at run-time. In this paper, we describe the Picos Hardware Design and the latencies of the main functionality of its components, based on the synthesis of their VHDL design. We have implemented a full cycle-accurate simulator based on those latencies to perform a design exploration of the characteristics and number of its components in a reasonable amount of time.Finally, we present a comparison of the Picos and Nanos++ runtime performance scalability with a set of real benchmarks. With Picos, a programmer can achieve ideal scalability using aggressive parallel strategies with a large number of fine granularity tasks.

show abstract

“…Dataflow computing offers a simple way to achieve high-performance, and high degree of concurrency and speculation, by means of implicit synchronization [2], [3]. Architectural exploitation of dataflow principles have been investigated in several research works [4]- [11]. Generally, dataflow-inspired execution models split the applications into a large set of threads [14], [25].…”

Section: Introductionmentioning

confidence: 99%

Dataflow Support in x86_64 Multicore Architectures through Small Hardware Extensions

Mondelli

Scionti³

et al. 2015

2015 Euromicro Conference on Digital System Design

View full text Add to dashboard Cite

Abstract-The path towards future high performance computers requires architectures able to efficiently run multi-threaded applications. In this context, dataflow-based execution models can improve the performance by limiting the synchronization overhead, thanks to a simple producer-consumer approach. This paper advocates the ISE of standard cores with a small hardware extension for efficiently scheduling the execution of threads on the basis of dataflow principles. A set of dedicated instructions allow the code to interact with the scheduler. Experimental results demonstrate that, the combination of dedicated scheduling units and a dataflow execution model improve the performance when compared with other techniques for code parallelization (e.g., OpenMP, Cilk).

show abstract

Task Superscalar: An Out-of-Order Task Pipeline

Cited by 108 publications

References 19 publications

Enabling preemptive multiprogramming on GPUs

Enabling preemptive multiprogramming on GPUs

Picos: A hardware runtime architecture support for OmpSs

Dataflow Support in x86_64 Multicore Architectures through Small Hardware Extensions

Contact Info

Product

Resources

About