A scalable architecture for ordered parallelism

Jeffrey, Mark C.; Subramanian, Sandya; Yan, Cong; Emer, Joel; Sánchez, Daniel

doi:10.1145/2830772.2830777

Cited by 65 publications

(43 citation statements)

References 95 publications

(101 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…through thread-level speculation (TLS) or hardware transactional memory (HTM), has two major benefits over non-speculative parallelism: it uncovers abundant parallelism in many challenging applications [28,37] and simplifies parallel programming [59]. However, even with scalable versioning and conflict detection techniques [14,37,56,67], speculative systems scale poorly beyond a few tens of cores. A key reason is that these systems do not exploit much of the locality available in speculative programs.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Data-centric execution of speculative parallel programs

Jeffrey

Subramanian

Abeydeera

et al. 2016

2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Self Cite

View full text Add to dashboard Cite

Abstract-Multicore systems must exploit locality to scale, scheduling tasks to minimize data movement. While localityaware parallelism is well studied in non-speculative systems, it has received little attention in speculative systems (e.g., HTM or TLS), which hinders their scalability.We present spatial hints, a technique that leverages program knowledge to reveal and exploit locality in speculative parallel programs. A hint is an abstract integer, given when a speculative task is created, that denotes the data that the task is likely to access. We show it is easy to modify programs to convey locality through hints. We design simple hardware techniques that allow a state-of-the-art, tiled speculative architecture to exploit hints by: (i) running tasks likely to access the same data on the same tile, (ii) serializing tasks likely to conflict, and (iii) balancing tasks across tiles in a locality-aware fashion. We also show that programs can often be restructured to make hints more effective.Together, these techniques make speculative parallelism practical on large-scale systems: at 256 cores, hints achieve nearlinear scalability on nine challenging applications, improving performance over hint-oblivious scheduling by 3.3× gmean and by up to 16×. Hints also make speculation far more efficient, reducing wasted work by 6.4× and traffic by 3.5× on average.

show abstract

Section: Introductionmentioning

confidence: 99%

“…We show it is easy to modify programs to convey locality through hints. We enhance a state-of-theart tiled speculative architecture, Swarm [37,38], to exploit hints by sending tasks with the same hint to the same tile and running them serially. We then analyze how task structure affects the effectiveness of hints (Sec.…”

Section: Introductionmentioning

confidence: 99%

Data-centric execution of speculative parallel programs

Jeffrey

Subramanian

Abeydeera

et al. 2016

2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In Pangaea [40] the CPU schedules tasks on the GPU, and both communicate via user-level interrupts. Swarm [12] relies on speculative task execution and conflict detection to preserve dependences. Swarm requires hardware support for speculation instead of for dependence management and uses either a FIFO or a spatial scheduler fixed in the architecture [13].…”

Section: Related Workmentioning

confidence: 99%

Architectural Support for Task Dependence Management with Flexible Software Scheduling

Castillo

Alvarez

Moretó³

et al. 2018

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Abstract-The growing complexity of multi-core architectures has motivated a wide range of software mechanisms to improve the orchestration of parallel executions. Task parallelism has become a very attractive approach thanks to its programmability, portability and potential for optimizations. However, with the expected increase in core counts, fine-grained tasking is required to exploit the available parallelism, which increases the overheads introduced by the runtime system. This work presents Task Dependence Manager (TDM), a hardware/software co-designed mechanism to mitigate runtime system overheads. TDM introduces a hardware unit, denoted Dependendence Management Unit (DMU), and minimal ISA extensions that allow the runtime system to offload costly dependence tracking operations to the DMU and to still perform task scheduling in software. With lower hardware cost, TDM outperforms hardware-based solutions and enhances the flexibility, adaptability and composability of the system. Results show that TDM improves performance by 12.3% and reduces EDP by 20.4% on average with respect to a software runtime system. Compared to a runtime system fully implemented in hardware, TDM achieves an average speedup of 4.2% with 7.3x less area requirements and significant EDP reductions. In addition, five different software schedulers are evaluated with TDM, illustrating its flexibility and performance gains. I. INTRODUCTIONThe end of Dennard scaling [1] and the subsequent stagnation of the CPU clock frequency has caused a dramatic increase in the core counts of multi-cores [2]. To fully exploit these large core counts in an efficient way, the hardware and the software stack must collaborate to avoid performance problems such as load imbalance or memory bandwidth exhaustion, while improving energy efficiency.The growing complexity of multi-cores has brought sophisticated software mechanisms aiming at optimally managing parallel workloads. One of the most extended approaches is task-based programming models, such as OpenMP 4.0 [3], that apply a data-flow execution model to orchestrate the execution of the parallel tasks respecting their control and data dependences. These programming models are a very appealing solution to program complex multicores due to their benefits in performance, programmability, cross-platform flexibility, and potential for applying generic optimizations at the runtime system level [4]- [9].A key aspect of this execution model is the granularity of the tasks. Fine-grain parallelism exposes large degrees of

show abstract

“…while in U, but transactions that read and update the line while in U would not abort each other. Other contexts: COMMTM's techniques could be used in contexts beyond TM that require speculative execution of atomic regions, such as architectural support for implicit parallelism [18,24,49] or deterministic multithreading [17].…”

Section: Generalizing Commtmmentioning

confidence: 99%

Exploiting semantic commutativity in hardware speculation

Zhang¹,

Chiu²,

Sánchez³

2016

2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Self Cite

View full text Add to dashboard Cite

Abstract-Hardware speculative execution schemes such as hardware transactional memory (HTM) enjoy low run-time overheads but suffer from limited concurrency because they rely on reads and writes to detect conflicts. By contrast, software speculation schemes can exploit semantic knowledge of concurrent operations to reduce conflicts. In particular, they often exploit that many operations on shared data, like insertions into sets, are semantically commutative: they produce semantically equivalent results when reordered. However, software techniques often incur unacceptable run-time overheads.To solve this dichotomy, we present COMMTM, an HTM that exploits semantic commutativity. COMMTM extends the coherence protocol and conflict detection scheme to support user-defined commutative operations. Multiple cores can perform commutative operations to the same data concurrently and without conflicts. COMMTM preserves transactional guarantees and can be applied to arbitrary HTMs.COMMTM scales on many operations that serialize in conventional HTMs, like set insertions, reference counting, and top-K insertions, and retains the low overhead of HTMs. As a result, at 128 cores, COMMTM outperforms a conventional eager-lazy HTM by up to 3.4× and reduces or eliminates aborts.

show abstract

A scalable architecture for ordered parallelism

Cited by 65 publications

References 95 publications

Data-centric execution of speculative parallel programs

Data-centric execution of speculative parallel programs

Architectural Support for Task Dependence Management with Flexible Software Scheduling

Exploiting semantic commutativity in hardware speculation

Contact Info

Product

Resources

About