Achieving predictable performance through better memory controller placement in many-core CMPs

Abts, Dennis; Jerger, Natalie Enright; Kim, John; Gibson, Dan; Lipasti, Mikko H.

doi:10.1145/1555815.1555810

Cited by 63 publications

(89 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 10 shows the latency results for each broadcasting scheme when using a different core to initiate the broadcast. Physical placement in the network can have an impact on both latency and congestion [11]; a broadcasting core can produce a hot-spot in the network. Therefore, it is interesting to evaluate the impact of source placement.…”

Section: Discussionmentioning

confidence: 99%

Performance analysis of broadcasting algorithms on the Intel Single-Chip Cloud Computer

Matienzo

Jerger

2013

2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

View full text Add to dashboard Cite

Abstract-Efficient broadcasting is essential for good performance on distributed or multiprocessor systems. Broadcasts are commonly used to implement message passing synchronization primitives, such as barriers, and also appear frequently in the set up stage of scientific applications. The Intel Single-Chip Cloud Computer (SCC), an experimental processor, uses synchronous message passing to facilitate communication between its 48 cores. RCCE, the SCC's message passing library, implements broadcasting in a traditional way: sending n−1 unicast messages, where n is the number of cores participating in the broadcast. This implementation can hinder performance as the number of cores participating in the broadcast increases and if the data being sent to each core is large. Also in the RCCE implementation, the broadcasting core is blocked from doing any useful work until all cores receive the broadcast. This paper explores several broadcasting schemes that take advantage of the resources of the SCC and the RCCE library. For example, we explore a scheme that propagates a broadcast to multiple cores in parallel and a scheme that parallelizes offchip memory accesses which would otherwise need to be done sequentially. Our best broadcast scheme achieves a 35× speedup over the RCCE implementation. We also demonstrate that our improved broadcasting substantially reduces the time spent on communication in some benchmarks. While the broadcast schemes presented in this paper are implemented specifically for the SCC, they provide insight into the more general problem of broadcast communication and could be adapted to other types of distributed and multiprocessor systems.

show abstract

Section: Discussionmentioning

confidence: 99%

Performance analysis of broadcasting algorithms on the Intel Single-Chip Cloud Computer

Matienzo

Jerger

2013

2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

View full text Add to dashboard Cite

show abstract

“…With a simple location-based mapping, most processors access only one memory controller. The pin pressure on current large-scale chips make more than four memory controllers a challenge to support [2,59]. A 2D mesh onchip network is used with dimension-order routing (DOR) and four-stage input-buffered routers [17].…”

Section: Methodsmentioning

confidence: 99%

“…Memory bandwidth is not scaling rapidly enough to satisfy the increasing number of processors, making the performance of a wide variety of applications constrained by memory bandwidth [70,66,59,12,32,18,19,28,30,2,55]. In fact, current projections state that chip pins increase by 10% every year whereas on-chip processors double every 18 months [59].…”

Section: Introductionmentioning

confidence: 99%

Collective Memory Transfers for Multi-Core Chips

Williams¹,

Shalf²

2013

View full text Add to dashboard Cite

“…Kilo-NOC [8] uses also two kinds of routers, QoS-enabled and not QoS-enabled, to provide low cost, scalable and energy-efficient QoS guarantees in a network. Prior works have also investigated co-designing the NoC with caches [9] and memory controllers [10]. In particular, work in [9] examined heterogeneous wires with varying width, latency and energy, and proposed mapping coherence messages with differing latency and bandwidth characteristics onto the different wires.…”

Section: International Conference On Computer Science and Service Sysmentioning

confidence: 99%

A case of area- and energy-efficient heterogeneous mesh network-on-chip

Yan¹,

Lai²,

Lin³

2014

Proceedings of the 3rd International Conference on Computer Science and Service System

View full text Add to dashboard Cite

Abstract-In this paper, based on observation on performance analysis of mesh network, we proposed a case of area-and energy-efficient heterogeneous mesh network by redistributing and reconfiguring scarce network resources, buffers, links and ports, to enhance network performance. Experimental results show that proposed network can achieve maximum saturation improvement by up to 16.7% and improve network latency by up to 35% while reduce about 31.7% router area. Experimental results also show that diagonal link is efficient design for mesh network topology.

show abstract

Achieving predictable performance through better memory controller placement in many-core CMPs

Abstract: In

Cited by 63 publications

References 24 publications

Performance analysis of broadcasting algorithms on the Intel Single-Chip Cloud Computer

Performance analysis of broadcasting algorithms on the Intel Single-Chip Cloud Computer

Collective Memory Transfers for Multi-Core Chips

A case of area- and energy-efficient heterogeneous mesh network-on-chip

Contact Info

Product

Resources

About