Tinsel: A Manythread Overlay for FPGA Clusters

Naylor, Matthew; Moore, Simon W.; Thomas, David B.

doi:10.1109/fpl.2019.00066

Cited by 21 publications

(24 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3) A single-threaded x86 version, running on an Intel i9-7940X PC. This is only intended as a simple baseline; here, we do not compare the performance of our research platform against conventional compute clusters (some such comparisons can be found in a previous paper [5]). For the PR application, we reuse the implementation from the GAP benchmark suite [11].…”

Section: Benchmark Applications and Graphsmentioning

confidence: 99%

“…This is the hypothesis of the POETS project (Partial Ordered Event Triggered Systems [4]), which forms the wider context for the work described in this paper. On the project, we have constructed a research platform consisting of a 48-FPGA cluster and a manycore RISC-V overlay called Tinsel [5] programmed on top. This serves both as a rapid prototyping environment for computer architecture research and, for certain applications, a genuine hardware accelerator.…”

Section: Introductionmentioning

confidence: 99%

“…This serves both as a rapid prototyping environment for computer architecture research and, for certain applications, a genuine hardware accelerator. For example, in previous work [5] we have shown the potential for significant performance improvements over a standard Xeon cluster for HPC applications written using the vertex-centric programming model popularised by Google's Pregel [3]. Below, we outline the design of the research platform, and its asynchronous message-passing primitives, before presenting our termination-detection extension in the next section.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Termination detection for fine-grained message-passing architectures

Naylor

Moore

Mokhov

et al. 2020

2020 IEEE 31st International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

Self Cite

View full text Add to dashboard Cite

Barrier primitives provided by standard parallel programming APIs are the primary means by which applications implement global synchronisation. Typically these primitives are fully-committed to synchronisation in the sense that, once a barrier is entered, synchronisation is the only way out. For message-passing applications, this raises the question of what happens when a message arrives at a thread that already resides in a barrier. Without a satisfactory answer, barriers do not interact with message-passing in any useful way. In this paper, we propose a new refutable barrier primitive that combines with message-passing to form a simple, expressive, efficient, well-defined API. It has a clear semantics based on termination detection, and supports the development of both globally-synchronous and asynchronous parallel applications. To evaluate the new primitive, we implement it in a prototype large-scale message-passing machine with 49,152 RISC-V threads distributed over 48 FPGAs. We show that hardware support for the primitive leads to a highly-efficient implementation, capable of synchronisation rates that are an order-of-magnitude higher than what is achievable in software. Using the primitive, we implement synchronous and asynchronous versions of a range of applications, observing that each version can have significant advantages over the other, depending on the application. Therefore, a barrier primitive supporting both styles can greatly assist the development of parallel programs.

show abstract

Section: Benchmark Applications and Graphsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Termination detection for fine-grained message-passing architectures

Naylor

Moore

Mokhov

et al. 2020

2020 IEEE 31st International Conference on Application-Specific Systems, Architectures and Processors (ASAP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Instead, processors consisting of larger numbers of far simpler cores, communicating by messagepassing or PGAS, can achieve more performance from a single chip, and scale more easily to large numbers of chips. This is the premise behind a number of recently developed manycore designs [1,2,3,4,5,6,7].…”

Section: Introductionmentioning

confidence: 99%

“…As part of a larger project, we have constructed a research platform consisting of a 48-FPGA cluster and a manycore RISC-V overlay programmed on top [7,9]. As well as providing a reconfigurable compute fabric, FPGAs also support a high degree of scalability due to advanced inter-chip networking capabilities.…”

Section: Introductionmentioning

confidence: 99%

General hardware multicasting for fine-grained message-passing architectures

Naylor

Moore

Thomas

et al. 2021

2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

Self Cite

View full text Add to dashboard Cite

Manycore architectures are increasingly favouring message-passing or partitioned global address spaces (PGAS) over cache coherency for reasons of power efficiency and scalability. However, in the absence of cache coherency, there can be a lack of hardware support for one-to-many communication patterns, which are prevalent in some application domains. To address this, we present new hardware primitives for multicast communication in rack-scale manycore systems. These primitives guarantee delivery to both colocated and distributed destinations, and can capture large unstructured communication patterns precisely. As a result, reliable multicast transfers among any number of software tasks, connected in any topology, can be fully offloaded to hardware. We implement the new primitives in a research platform consisting of 50K RISC-V threads distributed over 48 FPGAs, and demonstrate significant performance benefits on a range of applications expressed using a high-level vertexcentric programming model.

show abstract

Practical Distributed Implementation of Very Large Scale Petri Net Simulations

Rafiev

Morris

Xia

et al. 2022

Transactions on Petri Nets and Other Models of Concurrency XVI

Self Cite

View full text Add to dashboard Cite

With the continued increase of size and complexity of contemporary digital systems, there is a growing need for models of large size and high complexity, as well as methods of analyzing such models. This paper presents a method for simulating large-scale concurrent Petri net models using parallel distributed hardware platforms. By using POETS architecture, our method allows the mapping of concurrent Petri net executions onto 49,152 parallel processing hardware threads to achieve orders of magnitude (45 to 220 times) improvements of simulation speed, compared to conventional simulation methods using single processor systems. The presented method employs techniques including Petri net model partitioning, the use of max-step and locallyinterleaving semantics, and the fair firing of transitions.

show abstract

Tinsel: A Manythread Overlay for FPGA Clusters

Cited by 21 publications

References 14 publications

Termination detection for fine-grained message-passing architectures

Termination detection for fine-grained message-passing architectures

General hardware multicasting for fine-grained message-passing architectures

Practical Distributed Implementation of Very Large Scale Petri Net Simulations

Contact Info

Product

Resources

About