Spatial hardware implementation for sparse graph algorithms in GraphStep

deLorimier, Michael; Kapre, Nachiket; Mehta, Neelesh B.; DeHon, André

doi:10.1145/2019583.2019584

Cited by 18 publications

(13 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, conventional processors perform poorly on irregular graph operations. The GraphStep architecture showed how to organize active computations around embedded memories in the FPGA to accelerate graph processing [88].…”

Section: E Integrated Memorymentioning

confidence: 99%

Reconfigurable Computing Architectures

2015

View full text Add to dashboard Cite

This paper provides an overview of the broad body-of-knowledge developed in the field of reconfigurable computing.ABSTRACT | Reconfigurable architectures can bring unique capabilities to computational tasks. They offer the performance and energy efficiency of hardware with the flexibility of software. In some domains, they are the only way to achieve the required, real-time performance without fabricating custom integrated circuits. Their functionality can be upgraded and repaired during their operational lifecycle and specialized to the particular instance of a task. We survey the field of reconfigurable computing, providing a guide to the body-ofknowledge accumulated in architecture, compute models, tools, run-time reconfiguration, and applications.

show abstract

Section: E Integrated Memorymentioning

confidence: 99%

Reconfigurable Computing Architectures

2015

View full text Add to dashboard Cite

show abstract

“…The phenomenon here is closely related to the ones explored in DeHon [2015], and we similarly show that it is often better to distribute the data and computation than to centralize it in a single memory. GraphStep [Delorimier et al 2011] provides one concrete model for how applications might be defined to allow this form of parallelism tuning. In the remainder of this section, we explain and model the opposing communication energy effects and show how they give rise to this optimum energy point.…”

Section: Parallelism and Data Movement Energymentioning

confidence: 99%

Impact of Parallelism and Memory Architecture on FPGA Communication Energy

Kadrić

Lakata

DeHon

2016

ACM Trans. Reconfigurable Technol. Syst.

Self Cite

View full text Add to dashboard Cite

The energy in FPGA computations is dominated by data communication energy, either in the form of memory references or data movement on interconnect. In this article, we explore how to use data placement and parallelism to reduce communication energy. We show that parallelism can reduce energy and that the optimal level of parallelism increases with the problem size. We further explore how FPGA memory architecture (memory block size(s), memory banking, and spacing between memory banks) can impact communication energy, and determine how to organize the memory architecture to guarantee that the energy overhead compared to the optimally matched architecture for the design is never more than 60%. We specifically show that an architecture with 32 bit wide, 16Kb internally banked memories placed every 8 columns of 10 4-LUT logic blocks is within 61% of the optimally matched architecture across the VTR 7 benchmark set and a set of parallelism-tunable benchmarks. Without internal banking, the worst-case overhead is 98%, achieved with an architecture with 32 bit wide, 8Kb memories placed every 9 columns, roughly comparable to the memory organization on the Cyclone V (where memories are placed about every 10 columns). Monolithic 32 bit wide, 16Kb memories placed every 10 columns (comparable to 18Kb and 20Kb memories used in Virtex 4 and Stratix V FPGAs) have a 180% worst-case energy overhead. Furthermore, we show practical cases where designs mapped for optimal parallelism use 4.7× less energy than designs using a single processing element.

show abstract

“…The application traffic is the complete set of communication messages between nodes during Bellman-Ford shortest path computations mapped onto finite number of NoC PEs. The original computation is based on a Barrier Synchronized Parallel model that divides computation into separate steps [3]. Since the overall run time for the whole computation depends on each step period, the maximum number of cycles to route a single step is an important metric for performance evaluation.…”

Section: A Experimental Setupmentioning

confidence: 99%

“…multiprocessors [1], CoRAM [2], sparse graph processing [3], dynamic reconfigurable accelerators [4]). Natively, today's FPGAs provide high dedicated bandwidth with configured interconnect, but only modest dynamically shared bandwidth with hardwired buses [5].…”

Section: Introductionmentioning

confidence: 99%

FPGA optimized packet-switched NoC using split and merge primitives

Huan

DeHon

2012

2012 International Conference on Field-Programmable Technology

View full text Add to dashboard Cite

Abstract-Due to their different cost structures, the architecture of switches for an FPGA packet-switched Networkon-a-Chip (NoC) should differ from their ASIC counterparts. The CONNECT network recently demonstrated several ways in which packet-switched FPGA NoCs should differ from ASIC NoCs. However, they also concluded that pipelining was not appropriate for the FPGA switches. We show that the Split-Merge switch architecture is more amenable to pipelining on FPGAs, achieving 300MHz operation-up to three times the frequency and throughput of the CONNECT switches-with only 13-37% more area. Furthermore, we show that the Split-Merge switches are at least as efficient at routing traffic as the CONNECT switches, meaning the 2-3× frequency translates directly into two to three times the application performance.

show abstract

Spatial hardware implementation for sparse graph algorithms in GraphStep

Cited by 18 publications

References 26 publications

Reconfigurable Computing Architectures

Reconfigurable Computing Architectures

Impact of Parallelism and Memory Architecture on FPGA Communication Energy

FPGA optimized packet-switched NoC using split and merge primitives

Contact Info

Product

Resources

About