MiSAR

Liang, Ching-Kai; Prvulovic, Milos

doi:10.1145/2749469.2750396

Cited by 13 publications

(7 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most proposals focus on using dedicated hardware (accelerators, networks, new instructions) to speed-up synchronization primitives [1,6,25,41,47,55,61,62,69]. However, these proposals incur large area overheads.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory Operations

Soria-Pardos

Armejach

Mück

et al. 2023

Proceedings of the 50th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

With increasing core counts in modern multi-core designs, the overhead of synchronization jeopardizes the scalability and efficiency of parallel applications. To mitigate these overheads, modern cache-coherent protocols offer support for Atomic Memory Operations (AMOs) that can be executed nearcore (near) or remotely in the on-chip memory hierarchy (far).This paper evaluates current available static AMO execution policies implemented in multi-core Systems-on-Chip (SoC) designs, which select AMOs' execution placement (near or far) based on the cache block coherence state. We propose three static policies and show that the performance of static policies is application dependent. Moreover, we show that one of our proposed static policies outperforms currently available implementations.Furthermore, we propose DynAMO, a predictor that selects the best location to execute the AMOs. DynAMO identifies the different locality patterns to make informed decisions, improving AMO latency and increasing overall throughput. DynAMO outperforms the best-performing static policy and provides geometric mean speed-ups of 1.09× across all workloads and 1.31× on AMO-intensive applications with respect to executing all AMOs near.

show abstract

Section: Related Workmentioning

confidence: 99%

“…However, these approaches require support for Floating Operations in the cache hierarchy. Academia and industry have proposed a plethora of alternatives to near and far AMOs, introducing new synchronization instructions [3,6,47,55,61,62]. However, most ISAs are reticent to include instructions.…”

Section: Related Workmentioning

confidence: 99%

DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory Operations

Soria-Pardos

Armejach

Mück

et al. 2023

Proceedings of the 50th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

show abstract

“…ElTantawy, et al [13] propose a hardware warp scheduling policy that reduces lock retries by de-prioritizing warps whose threads are spin waiting. In addition, hardware accelerated locks have also been proposed for CPUs [4,25,42,47].…”

Section: Gpu Solutionsmentioning

confidence: 99%

Fast Fine-Grained Global Synchronization on GPUs

Wang

Fussell

Lin

2019

Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Syst

View full text Add to dashboard Cite

This paper extends the reach of General Purpose GPU programming by presenting a software architecture that supports efficient fine-grained synchronization over global memory. The key idea is to transform global synchronization into global communication so that conflicts are serialized at the thread block level. With this structure, the threads within each thread block can synchronize using low latency, high-bandwidth local scratchpad memory. To enable this architecture, we implement a scalable and efficient message passing library. Using Nvidia GTX 1080 ti GPUs, we evaluate our new software architecture by using it to solve a set of five irregular problems on a variety of workloads. We find that on average, our solutions improve performance over carefully tuned state-of-the-art solutions by 3.6×. CCS Concepts • Computer systems organization → Single instruction, multiple data; • Software and its engineering → Mutual exclusion; Message passing.

show abstract

“…synchronization, become expensive by default and can degrade performance of benchmarks that use it. Liang et al have demonstrated that poorly implementing synchronization can reduce performance by approximately 40% in widely used benchmarks despite representing a small fraction of the code [22]. In message passing systems, algorithms that make frequent use of collective communication patterns will see their performance significantly reduced when scaled.…”

Section: Impact Of Multicast Performance On the Manycore Architecturementioning

confidence: 99%

“…Some machines have provided advanced hardware support, such as the barrier network in Cray T3D [252], the collectives network in Blue Gene/L [253], and the fetchand-Φ operations in the SGI Origin [254]. Also, there are multiple research proposals for advanced hardware support for synchronization (e.g., [22,60,62,[255][256][257]). Yet as technology scaling delivers larger and larger manycore chips, these patterns are expected to remain costly to support within the chip.…”

Section: Wisync: An Architecture For Fast On-chip Synchronizationmentioning

confidence: 99%

Broadcast-oriented wireless network-on-chip : fundamentals and feasibility

Abadal Cavallé

View full text Add to dashboard Cite

Recent years have seen the emergence and ubiquitous adoption of Chip Multiprocessors (CMPs), which rely on the coordinated operation of multiple execution units or cores. Successive CMP generations integrate a larger number of cores seeking higher performance with a reasonable cost envelope. For this trend to continue, however, important scalability issues need to be solved at different levels of design. Scaling the interconnect fabric is a grand challenge by itself, as new Network-on-Chip (NoC) proposals need to overcome the performance hurdles found when dealing with the increasingly variable and heterogeneous communication demands of manycore processors. Fast and flexible NoC solutions are needed to prevent communication become a performance bottleneck, situation that would severely limit the design space at the architectural level and eventually lead to the use of software frameworks that are slow, inefficient, or less programmable. The emergence of novel interconnect technologies has opened the door to a plethora of new NoCs promising greater scalability and architectural flexibility. In particular, wireless on-chip communication has garnered considerable attention due to its inherent broadcast capabilities, low latency, and system-level simplicity. Most of the resulting Wireless Network-on-Chip (WNoC) proposals have set the focus on leveraging the latency advantage of this paradigm by creating multiple wireless channels to interconnect far-apart cores. This strategy is effective as the complement of wired NoCs at moderate scales, but is likely to be overshadowed at larger scales by technologies such as nanophotonics unless bandwidth is unrealistically improved. This dissertation presents the concept of Broadcast-Oriented Wireless Network-on-Chip (BoWNoC), a new approach that attempts to foster the inherent simplicity, flexibility, and broadcast capabilities of the wireless technology by integrating one on-chip antenna and transceiver per processor core. This paradigm is part of a broader hybrid vision where the BoWNoC serves latency-critical and broadcast traffic, tightly coupled to a wired plane oriented to large flows of data. By virtue of its scalable broadcast support, BoWNoC may become the key enabler of a wealth of unconventional hardware architectures and algorithmic approaches, eventually leading to a significant improvement of the performance, energy efficiency, scalability and programmability of manycore chips. The present work aims not only to lay the fundamentals of the BoWNoC paradigm, but also to demonstrate its viability from the electronic implementation, network design, and multiprocessor architecture perspectives. An exploration at the physical level of design validates the feasibility of the approach at millimeter-wave bands in the short term, and then suggests the use of graphene-based antennas in the terahertz band in the long term. At the link level, this thesis provides an insightful context analysis that is used, afterwards, to drive the design of a lightweight protocol that reliably serves broadcast traffic with substantial latency improvements over state-of-the-art NoCs. At the network level, our hybrid vision is evaluated putting emphasis on the flexibility provided at the network interface level, showing outstanding speedups for a wide set of traffic patterns. At the architecture level, the potential impact of the BoWNoC paradigm on the design of manycore chips is not only qualitatively discussed in general, but also quantitatively assessed in a particular architecture for fast synchronization. Results demonstrate that the impact of BoWNoC can go beyond simply improving the network performance, thereby representing a possible game changer in the manycore era. Avenços en el disseny de multiprocessadors han portat a una àmplia adopció dels Chip Multiprocessors (CMPs), que basen el seu potencial en la operació coordinada de múltiples nuclis de procés. Generacions successives han anat integrant més nuclis en la recerca d'alt rendiment amb un cost raonable. Per a que aquesta tendència continuï, però, cal resoldre importants problemes d'escalabilitat a diferents capes de disseny. Escalar la xarxa d'interconnexió és un gran repte en ell mateix, ja que les noves propostes de Networks-on-Chip (NoC) han de servir un tràfic eminentment variable i heterogeni dels processadors amb molts nuclis. Són necessàries solucions ràpides i flexibles per evitar que les comunicacions dins del xip es converteixin en el pròxim coll d'ampolla de rendiment, situació que limitaria en gran mesura l'espai de disseny a nivell d'arquitectura i portaria a l'ús d'arquitectures i models de programació lents, ineficients o poc programables. L'aparició de noves tecnologies d'interconnexió ha possibilitat la creació de NoCs més flexibles i escalables. En particular, la comunicació intra-xip sense fils ha despertat un interès considerable en virtut de les seva baixa latència, simplicitat, i bon rendiment amb tràfic broadcast. La majoria de les Wireless NoC (WNoC) proposades fins ara s'han centrat en aprofitar l'avantatge en termes de latència d'aquest nou paradigma creant múltiples canals sense fils per interconnectar nuclis allunyats entre sí. Aquesta estratègia és efectiva per complementar a NoCs clàssiques en escales mitjanes, però és probable que altres tecnologies com la nanofotònica puguin jugar millor aquest paper a escales més grans. Aquesta tesi presenta el concepte de Broadcast-Oriented WNoC (BoWNoC), un nou enfoc que intenta rendibilitzar al màxim la inherent simplicitat, flexibilitat, i capacitats broadcast de la tecnologia sense fils integrant una antena i transmissor/receptor per cada nucli del processador. Aquest paradigma forma part d'una visió més àmplia on un BoWNoC serviria tràfic broadcast i urgent, mentre que una xarxa convencional serviria fluxos de dades més pesats. En virtut de la escalabilitat i del seu suport broadcast, BoWNoC podria convertir-se en un element clau en una gran varietat d'arquitectures i algoritmes poc convencionals que milloressin considerablement el rendiment, l'eficiència, l'escalabilitat i la programabilitat de processadors amb molts nuclis. El present treball té com a objectius no només estudiar els aspectes fonamentals del paradigma BoWNoC, sinó també demostrar la seva viabilitat des dels punts de vista de la implementació, i del disseny de xarxa i arquitectura. Una exploració a la capa física valida la viabilitat de l'enfoc usant tecnologies longituds d'ona milimètriques en un futur proper, i suggereix l'ús d'antenes de grafè a la banda dels terahertz ja a més llarg termini. A capa d'enllaç, la tesi aporta una anàlisi del context de l'aplicació que és, més tard, utilitzada per al disseny d'un protocol d'accés al medi que permet servir tràfic broadcast a baixa latència i de forma fiable. A capa de xarxa, la nostra visió híbrida és avaluada posant èmfasi en la flexibilitat que aporta el fet de prendre les decisions a nivell de la interfície de xarxa, mostrant grans millores de rendiment per una àmplia selecció de patrons de tràfic. A nivell d'arquitectura, l'impacte que el concepte de BoWNoC pot tenir sobre el disseny de processadors amb molts nuclis no només és debatut de forma qualitativa i genèrica, sinó també avaluat quantitativament per una arquitectura concreta enfocada a la sincronització. Els resultats demostren que l'impacte de BoWNoC pot anar més enllà d'una millora en termes de rendiment de xarxa; representant, possiblement, un canvi radical a l'era dels molts nuclis

show abstract

MiSAR

Cited by 13 publications

References 24 publications

DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory Operations

DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory Operations

Fast Fine-Grained Global Synchronization on GPUs

Broadcast-oriented wireless network-on-chip : fundamentals and feasibility

Contact Info

Product

Resources

About