Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging

Dogan, Halit; Hijaz, Farrukh; Ahmad, Masab; Kahne, Brian; Wilson, Peter; Khan, Omer

doi:10.1109/ipdps.2017.116

Cited by 15 publications

(6 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The latter condition is to ensure long-term fairness which voids starvation of threads from other nodes. Before the server becomes a normal thread, it first releases the global and local locks (the line [30][31]. If the MCS queue has other waiting server threads, the current server thread will hand over the ownership of the global lock to the very next server thread, which is responsible to proceed to handle its local requests.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…MCSTP, 19 MCSCR, 1 and CST 23 address the preempting issue by employing sleep and wakeup approach. pLock 30 is a variant of an explicit intercore message passing (EMP)‐based lock 31 augmented with chaining and hierarchical features. The basic concept for an EMP‐based lock is to use a dedicated core as a server, and other cores as clients that request a lock from the server.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A scalable lock on NUMA multicore

Yao

2020

Concurrency and Computation

View full text Add to dashboard Cite

Modern NUMA multicore architectures exhibit complicated memory behavior, such as cache coherence invalidation and nonuniform memory access where the access from a core to its local memory is significantly faster than crossnode access to memory on a different NUMA node. The complicated memory behavior has a large impact on the efficiency of locking synchronization, which affects the performance of parallel applications. Prior works offer several efficient designs to improve locking performance such as delegation schemes. However, the existing delegation schemes either occupy computing cores or provide nonscalable performance, or offer less portability. In this work, we present a NUMA-aware delegation lock that occupies no cores while offering scalable performance under high contention for NUMA multicore machines. The new lock is a variant of an efficient FFWD lock, and inherits its performance features, such as buffering responses within a NUMA node to minimize cache coherence traffic. Unlike FFWD, the new lock employs hierarchical NUMA-aware memory allocation and NUMA-aware dynamic server thread technique, to reduce crossnode communication between client and server threads. Our evaluation shows that the new lock outperforms FFWD under high contention, achieving the significant performance gains when compared with other state-of-the-art locks.

show abstract

Section: Implementation Detailsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

A scalable lock on NUMA multicore

Yao

2020

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…The architecture requires a receive queue per core to support the proposed protocol, as seen in Figure 1. The size of each core's receive queue is determined empirically by conducting a study similar to the one presented in Dogan et al (2017). All workloads are run, and a counter is utilized in the simulator to determine maximum utilization of receive queues at any given time for each workload.…”

Section: Explicit Messaging Hardware Overheadmentioning

confidence: 99%

Accelerating Synchronization Using Moving Compute to Data Model at 1,000-core Multicore Scale

Dogan

Ahmad

Kahne³

et al. 2019

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

Thread synchronization using shared memory hardware cache coherence paradigm is prevalent in multicore processors. However, as the number of cores increase on a chip, cache line ping-pong prevents performance scaling for algorithms that deploy fine-grain synchronization. This article proposes an in-hardware moving computation to data model (MC) that pins shared data at dedicated cores. The critical code sections are serialized and executed at these cores in a spatial setting to enable data locality optimizations. In-hardware messages enable non-blocking and blocking communication between cores, without involving the cache coherence protocol. The in-hardware MC model is implemented on Tilera Tile-Gx72 multicore platform to evaluate 8-to 64-core count scale. A simulated RISC-V multicore environment is built to further evaluate the performance scaling advantages of the MC model at 1,024-cores scale. The evaluation using graph and machine-learning benchmarks illustrates that atomic instructions based synchronization scales up to 512 cores, and the MC model at the same core count outperforms by 27% in completion time and 39% in dynamic energy consumption. CCS Concepts: • Computer systems organization → Multicore architectures;

show abstract

“…Ham et al [23] proposed domain-specific Graphicionado, which exploits the data structure-centric datapath specialization and memory subsystem specialization. Dogan et al [22] proposed a shared memory multi-core architecture. By introducing hardware-level messaging instructions into ISA, this design can accelerate synchronization primitives and move computation towards data more efficiently.…”

Section: Graph Acceleration Architecturementioning

confidence: 99%

“…These hardware-level works need extensional devices for acceleration; thus, the overhead is larger than CGAcc because CGAcc is deployed in HMC, and HMC can be treated as the main memory system in a computer system. Some software-level works optimized graph processing by enriching the instruction set architecture [22] or customizing the compiler [23]. Software-level works cannot make full use of hardware.…”

Section: Introductionmentioning

confidence: 99%

CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube

et al. 2018

View full text Add to dashboard Cite

Graph traversal is widely used in map routing, social network analysis, causal discovery and many more applications. Because it is a memory-bound process, graph traversal puts significant pressure on the memory subsystem. Due to poor spatial locality and the increasing size of today’s datasets, graph traversal consumes an ever-larger part of application execution time. One way to mitigate this cost is memory prefetching, which issues requests from the processor to the memory in anticipation of needing certain data. However, traditional prefetching does not work well for graph traversal due to data dependencies, the parallel nature of graphs and the need to move vast amounts of data from memory to the caches. In this paper, we propose a compressed sparse row representation-based graph accelerator on the Hybrid Memory Cube (HMC), called CGAcc. CGAcc combines Compressed Sparse Row (CSR) graph representation with in-memory prefetching and processing to improve the performance of graph traversal. Our approach integrates the prefetching and processing in the logic layer of a 3D stacked Dynamic Random-Access Memory (DRAM) architecture, based on Micron’s HMC. We selected HMC to implement CGAcc because it can provide quite high bandwidth and low access latency. Furthermore, this device has multiple DRAM layers connected to internal logic to control memory access and perform rudimentary computation. Using the CSR representation, CGAcc deploys prefetchers in the HMC to exploit the short transaction latency between the logic and DRAM layers. By doing this, it can also avoid large data movement costs. In the runtime, CGAcc pipelines the prefetching to fetch data from DRAM arrays to improve memory-level parallelism. To further reduce the access latency, several optimized internal caches are also introduced to hold the prefetched data to be Processed In-Memory (PIM). A comprehensive evaluation shows the effectiveness of CGAcc. Experimental results showed that, compared to a conventional HMC main memory equipped with a stream prefetcher, CGAcc achieved an average 3.51× speedup with moderate hardware cost.

show abstract

Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging

Cited by 15 publications

References 26 publications

A scalable lock on NUMA multicore

A scalable lock on NUMA multicore

Accelerating Synchronization Using Moving Compute to Data Model at 1,000-core Multicore Scale

CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube

Contact Info

Product

Resources

About