Flexible architectural support for fine-grain scheduling

Sánchez, Daniel; Yoo, Richard M.; Kozyrakis, Christos

doi:10.1145/1735970.1736055

Cited by 34 publications

(31 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Task queue virtualization: Applications may create an unbounded number of tasks and schedule them for a future time. Swarm uses an overflow/underflow mechanism to give the illusion of unbounded hardware task queues [27,41,64]. When the per-tile task queue is nearly full, the task unit dispatches a special, non-speculative coalescer task to one of the cores.…”

Section: Handling Limited Queue Sizesmentioning

confidence: 99%

A scalable architecture for ordered parallelism

Jeffrey¹,

Subramanian²,

Yan³

et al. 2015

Proceedings of the 48th International Symposium on Microarchitecture

Self Cite

View full text Add to dashboard Cite

We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified timestamps. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover ordered parallelism. Swarm builds on prior TLS and HTM schemes, and contributes several new techniques that allow it to scale to large core counts and speculation windows, including a new execution model, speculation-aware hardware task management, selective aborts, and scalable ordered commits.We evaluate Swarm on graph analytics, simulation, and database benchmarks. At 64 cores, Swarm achieves 51-122× speedups over a single-core system, and outperforms software-only parallel algorithms by 3-18×.

show abstract

Section: Handling Limited Queue Sizesmentioning

confidence: 99%

A scalable architecture for ordered parallelism

Jeffrey¹,

Subramanian²,

Yan³

et al. 2015

Proceedings of the 48th International Symposium on Microarchitecture

Self Cite

View full text Add to dashboard Cite

show abstract

“…Delegation schemes divide shared data among threads and send updates to the corresponding thread, using shared-memory queues [11] or active messages [55,61]. Delegation is common in architectures that combine shared memory and message passing [55,64] and in NUMA-aware data structures [11,12].…”

Section: Software Techniquesmentioning

confidence: 99%

“…Delegation is common in architectures that combine shared memory and message passing [55,64] and in NUMA-aware data structures [11,12]. Delegation is the software counterpart to RMOs, and is subject to the same tradeoffs: it reduces data movement and synchronization, but incurs global traffic and serialization.…”

Section: Software Techniquesmentioning

confidence: 99%

Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems

Zhang

Horn

Sánchez

2015

Proceedings of the 48th International Symposium on Microarchitecture

Self Cite

View full text Add to dashboard Cite

We present Coup, a technique to lower the cost of updates to shared data in cache-coherent systems. Coup exploits the insight that many update operations, such as additions and bitwise logical operations, are commutative: they produce the same final result regardless of the order they are performed in. Coup allows multiple private caches to simultaneously hold update-only permission to the same cache line. Caches with updateonly permission can locally buffer and coalesce updates to the line, but cannot satisfy read requests. Upon a read request, Coup reduces the partial updates buffered in private caches to produce the final value. Coup integrates seamlessly into existing coherence protocols, requires inexpensive hardware, and does not affect the memory consistency model.We apply Coup to speed up single-word updates to shared data. On a simulated 128-core, 8-socket system, Coup accelerates state-of-the-art implementations of update-heavy algorithms by up to 2.4×.

show abstract

“…The growing popularity of task-based models has already motivated research into explicit hardware support for tasks. Carbon [13] and ADM [22] use hardware task queues to support fast task dispatch and stealing, whereas the Hyperprocessor [11] manages global dependencies using a universal register file.…”

Section: Related Workmentioning

confidence: 99%

A Dynamically Adaptable Hardware Transactional Memory

Lupon

Magklis

González

2010

2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture

View full text Add to dashboard Cite

Abstract-We present Task Superscalar, an abstraction of instruction-level out-of-order pipeline that operates at the tasklevel. Like ILP pipelines, which uncover parallelism in a sequential instruction stream, task superscalar uncovers tasklevel parallelism among tasks generated by a sequential thread. Utilizing intuitive programmer annotations of task inputs and outputs, the task superscalar pipeline dynamically detects intertask data dependencies, identifies task-level parallelism, and executes tasks out-of-order.Furthermore, we propose a design for a distributed task superscalar pipeline frontend, that can be embedded into any manycore fabric, and manages cores as functional units.We show that our proposed mechanism is capable of driving hundreds of cores simultaneously with non-speculative tasks, which allows our pipeline to sustain work windows consisting of tens of thousands of tasks. We further show that our pipeline can maintain a decode rate faster than 60ns per task and dynamically uncover data dependencies among as many as ∼50,000 in-flight tasks, using 7MB of on-chip eDRAM storage. This configuration achieves speedups of 95-255x (average 183x) over sequential execution for nine scientific benchmarks, running on a simulated CMP with 256 cores.Task superscalar thus enables programmers to exploit manycore systems effectively, while simultaneously simplifying their programming model.

show abstract

Flexible architectural support for fine-grain scheduling

Cited by 34 publications

References 52 publications

A scalable architecture for ordered parallelism

A scalable architecture for ordered parallelism

Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems

A Dynamically Adaptable Hardware Transactional Memory

Contact Info

Product

Resources

About