Programming Strategies for Irregular Algorithms on the Emu Chick

Hein, Eric R.; Eswar, Srinivas; Yaşar, Abdurrahman; Li, Jiajia; Young, Jeffrey; Conte, Thomas M.; Çatalyürek, Ümit V.; Vuduc, Richard; Riedy, Jason; Uçar, Bora

doi:10.1145/3418077

Cited by 5 publications

(2 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although there are multiple vectors for growth and improved infrastructure with the Rogues Gallery, the testbed has already led to some early successes. These positive outcomes include several published academic papers [10,11,14,29,35], support for PhD thesis research, ongoing collaborations with external academic, industry, and government users, and a job offer from one of the rogue startups for at least one of our PhD students at Georgia Tech. We look forward to new infrastructure developments, student-focused activities like the Rogues Gallery VIP class, and further collaborations with other post-Moore architecture and system evaluation labs to help drive the next phase of the Rogues Gallery's evolution.…”

Section: Discussionmentioning

confidence: 99%

Wrangling Rogues: Managing Experimental Post-Moore Architectures

Powell,

Riedy,

Young

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

The Rogues Gallery is a new experimental testbed that is focused on tackling rogue architectures for the post-Moore era of computing. While some of these devices have roots in the embedded and highperformance computing spaces, managing current and emerging technologies provides a challenge for system administration that are not always foreseen in traditional data center environments.We present an overview of the motivations and design of the initial Rogues Gallery testbed and cover some of the unique challenges that we have seen and foresee with upcoming hardware prototypes for future post-Moore research. Specifically, we cover networking, identity management, scheduling of resources, and tools and sensor access aspects of the Rogues Gallery along with techniques we have developed to manage these new platforms. We argue that current tools like the Slurm scheduler can support new rogues without major infrastructure changes.

show abstract

Section: Discussionmentioning

confidence: 99%

Wrangling Rogues: Managing Experimental Post-Moore Architectures

Powell,

Riedy,

Young

et al. 2018

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Similarly, the Emu Chick architecture [9] divides a DRAM channel into multiple "Narrow-Channel DIMMs" (NCDIMMs) which allow for more fine-grained access for irregular applications. Recent tests of BFS and small-scale graph analytics applications [11] shows promise for the NCDIMM approach with stable performance for different real-world sparse matrix multiply inputs and comparable performance with x86 platforms for BFS on balanced (Erdös-Rényi) graphs. However, performance of graph analytics on the Emu Chick is currently limited not by the memory subsystem but by data layout and workload imbalance issues that create thread migration hotspots across the Chick's distributed nodes.…”

Section: Related Workmentioning

confidence: 99%

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Green

Fox

Young

et al. 2019

2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3)

View full text Add to dashboard Cite

Graph processing is typically considered to be a memory-bound rather than compute-bound problem. One common line of thought is that more available memory bandwidth corresponds to better graph processing performance. However, in this work we demonstrate that the key factor in the utilization of the memory system for graph algorithms is not necessarily the raw bandwidth or even the latency of memory requests. Instead, we show that performance is proportional to the number of memory channels available to handle small data transfers with limited spatial locality.Using several widely used graph frameworks, including Gunrock (on the GPU) and GAPBS & Ligra (for CPUs), we evaluate key graph analytics kernels using two unique memory hierarchies, DDR-based and HBM/MCDRAM. Our results show that the differences in the peak bandwidths of several Pascal-generation GPU memory subsystems aren't reflected in the performance of various analytics. Furthermore, our experiments on CPU and Xeon Phi systems demonstrate that the number of memory channels utilized can be a decisive factor in performance across several different applications. For CPU systems with smaller thread counts, the memory channels can be underutilized while systems with high thread counts can oversaturate the memory subsystem, which leads to limited performance. Finally, we model the potential performance improvements of adding more memory channels with narrower access widths than are found in current platforms, and we analyze performance trade-offs for the two most prominent types of memory accesses found in graph algorithms, streaming and random accesses.

show abstract