Toward high-throughput algorithms on many-core architectures

Orozco, Daniel; Garcia, Elkin; Khan, Rishi; Livingston, Kelly; Gao, Guang R.

doi:10.1145/2086696.2086728

Cited by 17 publications

(7 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This works reasonably well on a system with a low core count, but becomes cumbersome as the number of cores on a chip increases [24], [36]. This is in part due to the competition for shared resources that is becoming fiercer between cores as their number increases.…”

Section: Introductionmentioning

confidence: 90%

Towards Memory-Load Balanced Fast Fourier Transformations in Fine-Grain Execution Models

Chen

Zuckerman

et al. 2013

2013 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum

Self Cite

View full text Add to dashboard Cite

Abstract-The codelet model is a fine-grain dataflow-inspired program execution model that balances the parallelism and overhead of the runtime system. It plays an important role in terms of performance, scalability, and energy efficiency in exascale studies such as the DARPA UHPC project and the DOE X-Stack project.As an important application, the Fast Fourier Transform (FFT) has been deeply studied in fine-grain models, including the codelet model. However, the existing work focuses on how fine-grain models achieve more balanced workload comparing to traditional coarse-grain models. In this paper, we make an important observation that the flexibility of execution order of tasks in fine-grain models improves utilization of memory bandwidth as well. We use the codelet model and the FFT application as a case study to show that a proper execution order of tasks (or codelets) can significantly reduce memory contention and thus improve performance. We propose an algorithm that provides a heuristic guidance of the execution order of the codelets to reduce memory contention.We implemented our algorithm on the IBM Cyclops-64 architecture. Experimental results show that our algorithm improves up to 46% performance compared to a state-of-the-art coarsegrain implementation of the FFT application on Cyclops-64.

show abstract

Section: Introductionmentioning

confidence: 90%

Towards Memory-Load Balanced Fast Fourier Transformations in Fine-Grain Execution Models

Chen

Zuckerman

et al. 2013

2013 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum

Self Cite

View full text Add to dashboard Cite

show abstract

“…Unfortunately, however, the Gottlieb queue has been proven to be non-linearizable [2] due to the counters, which can cause the queue to appear empty or full spuriously. Orozco et al [13] present two related array queues called the Circular Buffer Queue (CB-Queue) and the High-Throughput Queue (HT-Queue). The CB-queue merges the Gottlieb queue's two counters per side into one and in so doing offers linearizability.…”

Section: Related Workmentioning

confidence: 99%

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Scogland

Feng

2015

Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering

View full text Add to dashboard Cite

As core counts increase and as heterogeneity becomes more common in parallel computing, we face the prospect of programming hundreds or even thousands of concurrent threads in a single shared-memory system. At these scales, even highly-efficient concurrent algorithms and data structures can become bottlenecks, unless they are designed from the ground up with throughput as their primary goal.In this paper, we present three contributions: (1) a characterization of queue designs in terms of modern multi-and many-core architectures, (2) the design of a high-throughput, linearizable, blocking, concurrent FIFO queue for many-core architectures that avoids the bottlenecks and pitfalls common in modern queue designs, and (3) a thorough evaluation of concurrent queue throughput across CPU, GPU, and co-processor devices. Our evaluation shows that focusing on throughput, rather than progress guarantees, allows our queue to scale to as much as three orders of magnitude (1000×) faster than lock-free and combining queues on GPU platforms and two times (2×) faster on CPU devices. These results deliver critical insights into the design of data structures for highly concurrent systems: (1) progress guarantees do not guarantee scalability, and (2) allowing an algorithm to block can increase throughput.

show abstract

“…Unfortunately however, the Gottlieb queue has been proven to be non-linearizable [21]. Orozco et al [79] present two related array queues called the Circular Buffer Queue (CB-queue) and the High-Throughput…”

Section: Concurrent Data Structuresmentioning

confidence: 99%

Runtime Adaptation for Autonomic Heterogeneous Computing

Scogland

Feng

2014

2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

View full text Add to dashboard Cite

Heterogeneity is increasing across all levels of computing, with the rise of accelerators such as GPUs, FPGAs, and other coprocessors into everything from cell phones to supercomputers. More quietly it is increasing with the rise of NUMA systems, hierarchical caching, OS noise, and a myriad of other factors. As heterogeneity becomes a fact of life, efficiently managing heterogeneous compute resources is becoming a critical, and ever more complex, task. The focus of this dissertation is to lay the foundation for an autonomic system for heterogeneous computing, employing runtime adaptation to improve performance portability and performance consistency while maintaining or increasing programmability. We investigate heterogeneity arising from a myriad of factors, grouped into the dimensions of locality and capability. This work has resulted in runtime schedulers capable of automatically detecting and mitigating heterogeneity in physically homogeneous systems through MPI and adaptive coscheduling for physically heterogeneous accelerator based systems as well as a synthesis of the two to address multiple levels of heterogeneity as a coherent whole. We also discuss our current work towards the next generation of fine-grained scheduling and synchronization across heterogeneous platforms in the design of a highly-scalable and portable concurrent queue for many-core systems. Each component addresses aspects of the urgent need for automated management of the extreme and ever expanding complexity introduced by heterogeneity. I have also had the good fortune to collaborate with both Lawrence Livermore National Laboratory and Argonne National Laboratory. Experiences as an intern at each lab has had a significant effect on the final shape of my dissertation. Special thanks go to Dr. Pavan Balaji (again), Dr. Bronis de Supinski (again) and Dr. Barry Rountree. These collaborations proved to be major turning points for me, both in my research and my life as a whole.

show abstract

Toward high-throughput algorithms on many-core architectures

Cited by 17 publications

References 11 publications

Towards Memory-Load Balanced Fast Fourier Transformations in Fine-Grain Execution Models

Towards Memory-Load Balanced Fast Fourier Transformations in Fine-Grain Execution Models

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Runtime Adaptation for Autonomic Heterogeneous Computing

Contact Info

Product

Resources

About