Architectural Support for the Stream Execution Model on General-Purpose Processors

Proceedings of the 7th ACM International Conference on Computing Frontiers

Zampetakis

et al. 2010

Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds -the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized NI functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a mechanism for software configurable synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and evaluated the on-chip communication performance on the prototype as well as on a CMP simulator with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.

Section: Related Work and Contributionsmentioning

confidence: 99%

On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Kavadias

Proceedings of the 7th ACM International Conference on Computing Frontiers

Zampetakis

et al. 2010

“…Most pertinent to this work is Streamware [14] and the related architectural support [15] which target stream processing. They propose a software programmable Stream-LoadStore (SLS) hardware unit, that resembles EBP, and is used by the runtime software.…”

Section: Related Workmentioning

confidence: 99%

Prefetching and cache management using task lifetimes

Papaefstathiou

Proceedings of the 27th International ACM Conference on International Conference on Supercomputing

Nikolopoulos

et al. 2013

Task-based dataflow programming models and runtimes emerge as promising candidates for programming multicore and manycore architectures. These programming models analyze dynamically task dependencies at runtime and schedule independent tasks concurrently to the processing elements. In such models, cache locality, which is critical for performance, becomes more challenging in the presence of finegrain tasks, and in architectures with many simple cores.This paper presents a combined hardware-software approach to improve cache locality and offer better performance is terms of execution time and energy in the memory system. We propose the explicit bulk prefetcher (EBP) and epoch-based cache management (ECM) to help runtimes prefetch task data and guide the replacement decisions in caches. The runtime software can use this hardware support to expose its internal knowledge about the tasks to the architecture and achieve more efficient task-based execution. Our combined scheme outperforms HW-only prefetchers and state-of-the-art replacement policies, improves performance by an average of 17%, generates on average 26% fewer L2 misses, and consumes on average 28% less energy in the components of the memory system.

“…Streaming hardware support, for general purpose systems using caches, was considered in [4,9,10,27]. In [10] cache control bits are used for best-effort avoidance of replacements and scatter-gather enhancements of the L2 controller are proposed for a uniprocessor system.…”

Section: Related Work and Contributionsmentioning

confidence: 99%

“…In [10] cache control bits are used for best-effort avoidance of replacements and scatter-gather enhancements of the L2 controller are proposed for a uniprocessor system. Streamware [9] exploits the compiler to avoid replacements of streaming data mapped to processor caches, for codes amenable to stream processing.…”

Section: Related Work and Contributionsmentioning

confidence: 99%

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Kavadias

Zampetakis

et al. 2011

Int J Parallel Prog

Per-core scratchpad memories (or local stores) allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces, appropriate for scalable multicores, that combine the best of two worlds -the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized network interface (NI) functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multiword blocks through RDMA copy. Furthermore, we introduce event responses, as a technique that enables software configurable communication and synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, completion notifications for software selected sets of arbitrary size transfers, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and measure the logic overhead over a cache-only design for basic NI functionality to be less than 20%. We also evaluate the on-chip communication performance on the prototype, as well as the 123Int J Parallel Prog performance of synchronization functions with simulation of CMPs with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.