NoCMsg: Scalable NoC-Based Message Passing

Zimmer, Christopher; Mueller, Frank

doi:10.1109/ccgrid.2014.19

Cited by 7 publications

(5 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our study is based on native shared memory at the DRAM level that does not require load/stores to be rewritten, nor is it restricted to 8KB shared data. We recently ported NoCMsg to the Intel SCC [Patil 2014] and found that our flow control elimination in conjunction with contention-free communication patterns provide similar performance improvements on the Intel SCC compared to the Tilera [Zimmer and Mueller 2014;Yagna 2013].…”

Section: Related Workmentioning

confidence: 66%

NoCMsg

Zimmer

Mueller

2015

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

The number of cores of contemporary processors is constantly increasing and thus continues to deliver ever higher peak performance (following Moore's transistor law). Yet high core counts present a challenge to hardware and software alike. Following this trend, the network-on-chip (NoC) topology has changed from buses over rings and fully connected meshes to 2D meshes.This work contributes NoCMsg, a low-level message-passing abstraction over NoCs, which is specifically designed for large core counts in 2D meshes. NoCMsg ensures deadlock-free messaging for wormhole Manhattan-path routing over the NoC via a polling-based message abstraction and non-flow-controlled communication for selective communication patterns. Experimental results on the TilePro hardware platform show that NoCMsg can significantly reduce communication times by up to 86% for single packet messages and up to 40% for larger messages compared to other NoC-based message approaches. On the TilePro platform, NoCMsg outperforms shared memory abstractions by up to 93% as core counts and interprocess communication increase. Results for fully pipelined double-precision numerical codes show speedups of up to 64% for message passing over shared memory at 32 cores. Overall, we observe that shared memory scales up to about 16 cores on this platform, whereas message passing performs well beyond that threshold. These results generalize to similar NoC-based platforms. ACM Reference Format:Christopher Zimmer and Frank Mueller. 2015. NoCMsg: A scalable message-passing abstraction for networkon-chips. ACM Trans.

show abstract

Section: Related Workmentioning

confidence: 66%

NoCMsg

Zimmer

Mueller

2015

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The User Dynamic Network (UDN) interconnect is the only one available for user-generated messages. We use the services of the NoCMsg [33] library. NoCMsg provides a deadlock free, scalable and efficient low-level message passing layer over UDN with an MPI like interface.…”

Section: Methodsmentioning

confidence: 99%

“…With SSI, resources are aggregated to present a single view of the operating system environment while data access and communication are realized via shared memory over traditional bidirectional buses. This approach delivers some performance increases in the natural evolution from single core up to 16 cores, but it deteriorates rapidly when the number of cores increases further [33].…”

Section: Scalability Challenges Of Large-scale Manycoresmentioning

confidence: 99%

“…Cache misses and coherence updates are also extremely expensive in SSI approaches on multicores since each core has its own cache that must be coherent with shared memory, as well as with other cores. Recent work by Baumann et al [6], Wentzlaff et al [29] and Zimmer et al [33] show that coherent shared memory may not scale well to large core counts. They instead promote the usage of scalable message passing for OS communication in large-scale manycores.…”

Section: Scalability Challenges Of Large-scale Manycoresmentioning

confidence: 99%

“…We follow the core of the design principles postulated by Peter et al [24] for designing multi-core schedulers. We even go one level further and take a purely distributed message passing approach as the primary means of communication enabled by our adoption of NoCMsg [33] as our low-level messaging library.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Distributed Job Allocation for Large-Scale Manycores

Ramachandran

Mueller

2016

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

As today's manycore processors already feature over 64 cores and as tomorrow's are slated to contain 1000s, it is important to design operating system techniques that can efficiently cope with this scale of resource coordination. The current state-of-the-art in manycore processor architectures has evolved from traditional bus-based architectures over rings to mesh-based Network-on-Chip (NoC) interconnects. This implies an increasing potential for scalable message passing. However, contemporary operating systems heavily rely on single system images with shared memory constructs that may not scale well to large core counts. To address these challenges, we devise a distributed message passing only system comprised of so-called "pico-kernels" per core. They are controlled by dedicated "micro-kernels" topologically centered within a set of cores that cooperatively comprise the overall operating system in a peer-to-peer fashion.Such a system promotes rethinking and redesigning of various operating system services focusing on scalability as the primary design constraint. We consider the challenges of distributed allocation of jobs, each comprised of a set of tasks to be mapped to disjoint cores. A naive solution performing fragmented allocations may quickly escalate to deadlocks, where jobs hold and wait for cores in circular dependencies. To tackle these challenges, we propose a deadlock free distributed job allocation protocol. We have devised two policies for avoiding deadlocks, namely active cancellation and sequencer-based atomic broadcast. The protocol and the two policies have been implemented and evaluated on a Tilera TilePro64 processor with 64 cores on a single socket. Results show that for sparse job allocations active cancellation provides less job allocation overhead while for denser job allocations the sequencer-based atomic broadcast provides less overhead.

show abstract