Exploiting heterogeneous parallelism on a multithreaded multiprocessor

Alverson, Gail A.; Alverson, Robert; Callahan, David; Koblenz, Brian; Porterfield, Allan; Smith, Burton

doi:10.1145/143369.143408

Cited by 57 publications

(43 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For both the Alignment and SparseLU benchmarks, BOTS provides two different source files: one in which computation starts with a single initial task and another in which tasks are generated in a loop. 1 Other BOTS benchmarks are not presented here: UTS and FFT use of very fine-grained tasks without cutoffs, yielding poor performance on all run times, and floorplan raises compilation issues in ROSE.…”

Section: Discussionmentioning

confidence: 99%

Scheduling task parallelism on multi-socket multicore systems

Olivier

Porterfield

Wheeler

et al. 2011

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers

View full text Add to dashboard Cite

The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run time system. This is a welcome development for scientific computing as supercomputer nodes grow "fatter" with multicore and manycore processors. But efficient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and NUMA characteristics. In this paper, we propose a hierarchical scheduling strategy that leverages different methods at different levels of the hierarchy. By allowing one thread to steal work on behalf of all of the threads within a single chip that share a cache, our scheduler limits the number of costly remote steals. For cores on the same chip, a shared LIFO queue allows exploitation of cache locality between sibling tasks as well between a parent task and its newly created child tasks.We extended the open-source Qthreads threading library to implement our scheduler, accepting OpenMP programs through the ROSE compiler.We also present a comprehensive performance study of diverse OpenMP task parallel benchmarks, comparing seven different task parallel run time scheduler implementations on current generation multi-socket multicore systems: our hierarchical work stealing scheduler, a fully-distributed work stealing scheduler, a centralized scheduler, and LIFO and FIFO versions of the original Qthreads fullydistributed scheduler. In addition, we compare our results against OpenMP implementations from Intel and GCC. Hierarchical scheduling in Qthreads is competitive on all benchmarks. On several benchmarks, hierarchical scheduling in Qthreads demonstrates speedup and absolute performance superior to both the Intel and GCC OpenMP run time systems.

show abstract

Section: Discussionmentioning

confidence: 99%

Scheduling task parallelism on multi-socket multicore systems

Olivier

Porterfield

Wheeler

et al. 2011

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers

View full text Add to dashboard Cite

show abstract

“…In the early 90s the Tera Corporation, starting from the experience acquired with the Horizon machine, built the Tera MTA (Alverson et al, 1990(Alverson et al, , 1992. The MTA design consisted of 256 processors sharing 64 GB of memory organised as a distributed NUMA architecture.…”

Section: Multithreaded Architecturesmentioning

confidence: 99%

“…It is not possible to co-locate data and computation on the same node and data caches cannot be efficiently exploited. Multithreaded architectures, such as the Tera MTA (Alverson et al, 1990(Alverson et al, , 1992Snavely et al, 1998), are designed to mask memory long latency by rapidly switching between concurrent threads; hence, they are ideally suited for irregular applications.…”

Section: Introductionmentioning

confidence: 99%

Compiling irregular applications for reconfigurable systems

Halstead

Villarreal²,

Najjar

2014

IJHPCN

View full text Add to dashboard Cite

Algorithms that exhibit irregular memory access patterns are known to show poor performance on multiprocessor architectures, particularly when memory access latency is variable. Many common data structures, including graphs, trees, and linked-lists, exhibit these irregular memory access patterns. While FPGA-based code accelerators have been successful on applications with regular memory access patterns, they have not been further explored for irregular memory access patterns. Multithreading has been shown to be an effective technique in masking long latencies. We describe the compiler generation of concurrent hardware threads for FPGAs with the objective of masking the memory latency caused by irregular memory access patterns. The CHAT compiler extends the ROCCC toolset to generate customised state information for each dynamically generated thread. Initial results show a speed-up of 50x.

show abstract

“…In the early 90's the Tera Corporation, building upon the experience acquired with the Horizon machine, built the Tera MTA [2,1]. The MTA design consisted of 256 processors sharing 64 GB of memory organized as a distributed NUMA architecture.…”

Section: Multithreaded Architecturesmentioning

confidence: 99%

Exploring irregular memory accesses on FPGAs

Halstead

Villarreal²,

Najjar

2011

Proceedings of the 1st Workshop on Irregular Applications: Architectures and Algorithms

View full text Add to dashboard Cite

Algorithms that exhibit irregular memory access patterns are known to show poor performance on multiprocessor architectures, particularly when memory access latency is variable. Many common data structures, including graphs, trees, and linked-lists, exhibit these irregular memory access patterns. While FPGA-based code accelerators have been successful on applications with regular memory access patterns, they have not been further explored for irregular memory access patterns. Multithreading has been shown to be an e↵ective technique in masking long latencies. We describe the compiler generation of concurrent hardware threads for FPGAs with the objective of masking the memory latency caused by irregular memory access patterns. We extend the ROCCC compiler to generate customized state information for each dynamically generated thread.

show abstract

Exploiting heterogeneous parallelism on a multithreaded multiprocessor

Cited by 57 publications

References 12 publications

Scheduling task parallelism on multi-socket multicore systems

Scheduling task parallelism on multi-socket multicore systems

Compiling irregular applications for reconfigurable systems

Exploring irregular memory accesses on FPGAs

Contact Info

Product

Resources

About