Interleaving granularity on high bandwidth memory architecture for CMPs

Cabarcas, Felipe; Rico, Alejandro; Etsion, Yoav; Ramírez, Alex

doi:10.1109/icsamos.2010.5642060

Cited by 5 publications

(7 citation statements)

References 24 publications

(24 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are many arbiters belonging to the class of LR servers, such as TDM; Round-Robin and its variants Weighted Round-Robin (WRR) [Katevenis et al 1991] and Deficit Round-Robin (DRR) [Shreedhar and Varghese 1996]; and priority-based arbiters with a rate regulator, such as Credit-Controlled Static Priority (CCSP) [Akesson et al 2008] and Priority Based Scheduler (PBS) [Steine et al 2009]. The LR abstraction enables modeling of many different arbiters and is compatible with a variety of formal analysis frameworks, such as dataflow analysis [Sriram and Bhattacharyya 2000] or network calculus [Cruz 1991]. …”

Section: Lr Serversmentioning

confidence: 98%

“…Experimental Setup. The experimental setup consists of the optimization problem model implemented in the CPLEX optimization tool [CPLEX 2014]; implementation of our proposed heuristic, the First-fit and Interleave-all algorithms in C++, for a TDM arbiter; and a synthetic use-case generator. For a fair comparison with the heuristic, the First-fit and Interleave-all algorithms are also run with different TDM frame sizes to determine the optimal frame size with the lowest overallocation of rate (considering discretization of rate) and which satisfies the condition that the sum of rates allocated to all requestors in each channel is less than or equal to one.…”

Section: Optimal Heuristic and Existing Mapping Algorithms: Performmentioning

confidence: 99%

“…These requirements must be guaranteed at design time to reduce the verification effort, which is made possible using real-time memory controllers [Paolieri et al 2013;Akesson and Goossens 2011a;Reineke et al 2011;Shah et al 2012;Bayliss and Constantinides 2012;Wu et al 2013;Li et al 2014;Kim et al 2014] that bound the memory access time by employing predictable arbiters, such as Time Division Multiplexing (TDM) and Round-Robin. Moreover, real-time memory controllers can be analyzed using shared resource abstractions, such as the Latency-Rate (LR) server model [Stiliadis and Varma 1998], which can be used in formal performance analysis based on, for example, network calculus [Cruz 1991] or dataflow analysis [Sriram and Bhattacharyya 2000].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Real-Time Multichannel Memory Controller and Optimal Mapping of Memory Clients to Memory Channels

Gomony

Åkesson

Goossens

2015

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

Ever-increasing demands for main memory bandwidth and memory speed/power tradeoff led to the introduction of memories with multiple memory channels, such as Wide IO DRAM. Efficient utilization of a multichannel memory as a shared resource in multiprocessor real-time systems depends on mapping of the memory clients to the memory channels according to their requirements on latency, bandwidth, communication, and memory capacity. However, there is currently no real-time memory controller for multichannel memories, and there is no methodology to optimally configure multichannel memories in real-time systems. As a first work toward this direction, we present two main contributions in this article: (1) a configurable real-time multichannel memory controller architecture with a novel method for logical-to-physical address translation and (2) two design-time methods to map memory clients to the memory channels, one an optimal algorithm based on an integer programming formulation of the mapping problem, and the other a fast heuristic algorithm. We demonstrate the real-time guarantees on bandwidth and latency provided by our multichannel memory controller architecture by experimental evaluation. Furthermore, we compare the performance of the mapping problem formulation in a solver and the heuristic algorithm against two existing mapping algorithms in terms of computation time and mapping success ratio. We show that an optimal solution can be found in 2 hours using the solver and in less than 1 second with less than 7% mapping failure using the heuristic for realistically sized problems. Finally, we demonstrate configuring a Wide IO DRAM in a high-definition (HD) video and graphics processing system to emphasize the practical applicability and effectiveness of this work. ACM Reference Format:Manil Dev Gomony, Benny Akesson, and Kees Goossens. 2015. A real-time multichannel memory controller and optimal mapping of memory clients to memory channels.

show abstract

Section: Lr Serversmentioning

confidence: 98%

Section: Optimal Heuristic and Existing Mapping Algorithms: Performmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Real-Time Multichannel Memory Controller and Optimal Mapping of Memory Clients to Memory Channels

Gomony

Åkesson

Goossens

2015

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

show abstract

“…In a 64-core CMP, each group of 8 cores has access to main memory via a dedicated 58:5 memory controller, whereas in a 256-core CMP, each group of 16 cores has a dedicated memory controller. We have considered memory interleaving in our architecture and adapted its specific implementation from prior work [Cabarcas et al 2010]. A node (N) is defined as an entity consisting of 1 and 4 cores for the 64-core and 256-core CMPs, respectively.…”

Section: Ultranoc Architecture and Terminologymentioning

confidence: 99%

SWIFTNoC

Chittamuru

Desai

Pasricha

2017

J. Emerg. Technol. Comput. Syst.

View full text Add to dashboard Cite

On-chip communication is widely considered to be one of the major performance bottlenecks in contemporary chip multiprocessors (CMPs). With recent advances in silicon nanophotonics, photonics-based network-onchip (NoC) architectures are being considered as a viable solution to support communication in future CMPs as they can enable higher bandwidth and lower power dissipation compared to traditional electrical NoCs. In this article, we present SwiftNoC, a novel reconfigurable silicon-photonic NoC architecture that features improved multicast-enabled channel sharing, as well as dynamic re-prioritization and exchange of bandwidth between clusters of cores running multiple applications, to increase channel utilization and system performance. Experimental results show that SwiftNoC improves throughput by up to 25.4× while reducing latency by up to 72.4% and energy-per-bit by up to 95% over state-of-the-art solutions. CCS Concepts: r Networks → Network on chip; r Computer systems organization → Multicore architectures; r Hardware → Photonic and optical interconnect; Emerging optical and photonic technologies

show abstract

“…Parallel applications are usually very sensitive to synchronization latency and, therefore, hardware mechanisms are critical for CMPs; Castell is not an exception, as it is shown in Chapter 7. For an architecture with hundreds of cores, the accesses to shared resources can become a bottleneck if the synchronization mechanism is slow.…”

Section: Synchronization Modulementioning

confidence: 99%

Castell: a heterogeneous cmp architecture scalable to hundreds of processors

Cabarcas Jaramillo

View full text Add to dashboard Cite

Technology improvements and power constrains have taken multicore architectures to dominate microprocessor designs over uniprocessors. At the same time, accelerator based architectures have shown that heterogeneous multicores are very efficient and can provide high throughput for parallel applications, but with a high-programming effort. We propose Castell a scalable chip multiprocessor architecture that can be programmed as uniprocessors, and provides the high throughput of accelerator-based architectures. Castell relies on task-based programming models that simplify software development. These models use a runtime system that dynamically finds, schedules, and adds hardware-specific features to parallel tasks. One of these features is DMA transfers to overlap computation and data movement, which is known as double buffering. This feature allows applications on Castell to tolerate large memory latencies and lets us design the memory system focusing on memory bandwidth. In addition to provide programmability and the design of the memory system, we have used a hierarchical NoC and added a synchronization module. The NoC design distributes memory traffic efficiently to allow the architecture to scale. The synchronization module is a consequence of the large performance degradation of application for large synchronization latencies. Castell is mainly an architecture framework that enables the definition of domain-specific implementations, fine-tuned to a particular problem or application. So far, Castell has been successfully used to propose heterogeneous multicore architectures for scientific kernels, video decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW). It has also been used to explore a number of architecture optimizations such as enhanced DMA controllers, and architecture support for task-based programming models. iii

show abstract

Interleaving granularity on high bandwidth memory architecture for CMPs

Cited by 5 publications

References 24 publications

A Real-Time Multichannel Memory Controller and Optimal Mapping of Memory Clients to Memory Channels

A Real-Time Multichannel Memory Controller and Optimal Mapping of Memory Clients to Memory Channels

SWIFTNoC

Castell: a heterogeneous cmp architecture scalable to hundreds of processors

Contact Info

Product

Resources

About