The vector-thread architecture

Krashinsky, Ronny; Batten, Christopher; Hampton, Mark; Gerding, S.; Pharris, B.; Casper, Jared; Asanović, Krste

doi:10.1109/isca.2004.1310763

Cited by 78 publications

(62 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To evaluate data-parallel solutions, we used the Hwacha data-parallel accelerator with Rocket as its scalar control processor. The Hwacha data-parallel accelerator integrates ideas from both vector-thread [6,7] and conventional data-parallel processors to achieve high performance and energy efficiency. TFJ was used to generate optimized implementations for Rocket and Hwacha.…”

Section: Rocket-hwacha Vector Processormentioning

confidence: 99%

Measuring the gap between programmable and fixed-function accelerators: A case study on speech recognition

Lee

Sheffield

Waterman

et al. 2013

2013 IEEE Hot Chips 25 Symposium (HCS)

View full text Add to dashboard Cite

Section: Rocket-hwacha Vector Processormentioning

confidence: 99%

Measuring the gap between programmable and fixed-function accelerators: A case study on speech recognition

Lee

Sheffield

Waterman

et al. 2013

2013 IEEE Hot Chips 25 Symposium (HCS)

View full text Add to dashboard Cite

“…In order to scale the number of cores in a CMP above this barrier, and into the numbers of cores proposed for tiled architectures [4,6,19,28,29], it is necessary to resort to scalable (i.e., point-to-point) interconnect types. Such interconnects are suitable not only because their peak bandwidth naturally scales with the number of cores, but also because, due to the short-length wires and low radix, their area overhead is a fixed, independent fraction of the number of cores.…”

Section: Current Cmps and Coherence Mechanismsmentioning

confidence: 99%

“…There have been several proposals for tiled CMP architectures [4,6,19,28,29]. Most of these have focused on novel execution paradigms to exploit ILP and DLP in singlethreaded applications.…”

Section: Related Workmentioning

confidence: 99%

“…Either access latencies have to be significantly stretched or the area required by the interconnects has to be increased to the point of becoming impractical. Tiled CMPs [4,6,19,28,29] have been advocated as a possible alternative. Such systems are built from a relatively large number (≥ 32) of relatively simple cores plus a tightly integrated and lightweight point-to-point interconnect.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An OS-based alternative to full hardware coherence on tiled CMPs

Fensch

Cintra

2008

2008 IEEE 14th International Symposium on High Performance Computer Architecture

View full text Add to dashboard Cite

The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, such interconnects make snooping impractical and, thus, require alternative solutions to cache coherence. This paper proposes a novel, cost-effective mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence. The proposed mechanism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support and that hardware supports remote cache accesses. It allows only some controlled migration and replication of data and provides a sufficient degree of flexibility in the mapping through an extra level of indirection between virtual pages and physical tiles.We evaluate the proposed tiled CMP architecture on the Splash-2 scientific benchmarks and ALPBench multimedia benchmarks against one with private caches and a distributed directory cache coherence mechanism. Experimental results show that the performance degradation is as little as 0%, and 16% on average, compared to the cache coherent architecture across all benchmarks for 16 and 32 processors.

show abstract

“…Of course, if a longer clock period is employed a smaller number of larger tiles may be used. Such tile-based systems may implement arrays of homogeneous processor/cache tiles [9], [10], finer-grain computing fabrics [14] or networks of heterogeneous IP blocks. Such approaches provide highly reconfigurable platforms for a wide range of performance hungry applications.…”

Section: Introductionmentioning

confidence: 99%

The design and implementation of a low-latency on-chip network

Mullins¹,

West²,

Moore³

Asia and South Pacific Conference on Design Automation, 2006.

View full text Add to dashboard Cite

Abstract-Many of the issues that will be faced by the designers of multi-billion transistor chips may be alleviated by the presence of a flexible global communication infrastructure. In the short term, such a network will provide scalable chip-wide communication and ease the complexity of handling multi-cycle communications. In the long term, the network will become a primary tool for optimising power and data transfers and for scheduling computations. This paper details the design and implementation of a low-latency on-chip network. The network's speculative routers are in the best case able to route flits in a single clock cycle, helping to minimise on-chip communication latencies and maximise the effectiveness of buffering resources. Results from our 180nm test chip demonstrate an inter-router data transfer rate in excess of 16Gbit/s for each link. In the best case each router hop adds just 1 clock cycle to the final communication latency.

show abstract

The vector-thread architecture

Cited by 78 publications

References 10 publications

Measuring the gap between programmable and fixed-function accelerators: A case study on speech recognition

Measuring the gap between programmable and fixed-function accelerators: A case study on speech recognition

An OS-based alternative to full hardware coherence on tiled CMPs

The design and implementation of a low-latency on-chip network

Contact Info

Product

Resources

About