A framework for FPGA acceleration of large graph problems: Graphlet counting case study

Betkaoui, Brahim; Thomas, David B.; Luk, Wayne; Pržulj, Nataša

doi:10.1109/fpt.2011.6132667

Cited by 42 publications

(21 citation statements)

References 16 publications

(20 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For a nonexhaustive sample of the extensive body of work in this direction, including theoretical and engineering work, cf. [108,92,111,68,29,64,107,36,3,47,73,97,74,88,18,45,24,91,96,63,93,4,25,90]. Our work differs from these works in that we seek a proof-of-concept implementation for simultaneous delegatability and errortolerance.…”

Section: Counting and Enumerating Subgraphsmentioning

confidence: 88%

Engineering a Delegatable and Error-Tolerant Algorithm for Counting Small Subgraphs

Kaski

2018

2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX)

View full text Add to dashboard Cite

We study the problem of counting the number of occurrences of a given six-vertex pattern graph S in an n-vertex host graph H. We engineer an open-source GPU implementation of a distributed algorithm design of Björklund and Kaski [PODC 2016] where (i) the execution of the algorithm can be delegated [Goldwasser, Kalai, and Rothblum, J. ACM 2015] to produce a noninteractive probabilistically checkable proof of correctness, and (ii) the execution of the algorithm when preparing the proof tolerates a controllable number of adversarial errors. Experiments with NVIDIA Tesla K80 and Tesla P100 Accelerators demonstrate that the framework is practical for inputs of up to 512 vertices, with proof checking being several orders of magnitude more efficient than preparing the proof; however, proof preparation still carries at least one order of magnitude overhead compared with just solving the problem.

show abstract

Section: Counting and Enumerating Subgraphsmentioning

confidence: 88%

Engineering a Delegatable and Error-Tolerant Algorithm for Counting Small Subgraphs

Kaski

2018

2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX)

View full text Add to dashboard Cite

show abstract

“…For example, [2] elaborates a framework for large graph manipulation in hardware. Graph data, which cannot be partitioned and locally processed, is stored lineally in off-chip memories.…”

Section: Related Workmentioning

confidence: 99%

FPGA acceleration of semantic tree reasoning algorithms

Barba

Santofimia

Dondo

et al. 2015

Journal of Systems Architecture

View full text Add to dashboard Cite

“…While the prototypes described in [Canis et al, 2013;Chung et al, 2012;Cong and Xiao, 2013;Ismail and Shannon, 2011;Lysecky and Vahid, 2009;Pilato et al, 2012;Vassiliadis et al, 2004;Willenberg and Chow, 2013] implement the host and the kernels on the same chip (embedded hardwired or soft processor as the host), the implementation of [Benini et al, 2012;Betkaoui et al, 2011;Convey Computer, 2012;Ling et al, 2009;Putnam et al, 2014;Schumacher et al, 2012;Stuecheli, 2013;Voros et al, 2013] uses different chips for the host and the kernels.…”

Section: Communication Infrastructurementioning

confidence: 99%

“…The P2012 architecture [Benini et al, 2012] • Shared memory: Shared memory is used in many commercial hardware accelerator systems for high performance computing. Intel proposes a system using a Front Side Bus (FSB) [Ling et al, 2009] • Crossbar: The research in [Betkaoui et al, 2011] proposed a framework for accelerating large graph problems. The target system includes graph processing elements (GPEs) connected with memory modules through a full crossbar.…”

Section: Communication Infrastructurementioning

confidence: 99%

Hybrid Interconnect Design for Heterogeneous Hardware Accelerators

Pham‐Quoc¹,

Heisswolf²,

Werner³

et al. 2013

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2013

View full text Add to dashboard Cite

Heterogeneous multicore systems are becoming increasingly important as the need for computation power grows, especially when we are entering into the big data era. As one of the main trends in heterogeneous multicore, hardware accelerator systems provide application specific hardware circuits and are thus more energy efficient and have higher performance than general purpose processors, while still providing a large degree of flexibility. However, system performance dose not scale when increasing the number of processing cores due to the communication overhead which increases greatly with the increasing number of cores. Although data communication is a primary anticipated bottleneck for system performance, the interconnect design for data communication among the accelerator kernels has not been well addressed in hardware accelerator systems. A simple bus or shared memory is usually used for data communication between the accelerator kernels. In this dissertation, we address the issue of interconnect design for heterogeneous hardware accelerator systems.Evidently, there are dependencies among computations, since data produced by one kernel may be needed by another kernel. Data communication patterns can be specific for each application and could lead to different types of interconnect. In this dissertation, we use detailed data communication profiling to design an optimized hybrid interconnect that provides the most appropriate support for the communication pattern inside an application while keeping the hardware resource usage for the interconnect minimal. Firstly, we propose a heuristicbased approach that takes application data communication profiling into account to design a hardware accelerator system with a custom interconnect. A number of solutions are considered including crossbar-based shared local memory, direct memory access (DMA) supporting parallel processing, local buffers, and hardware duplication. This approach is mainly useful for embedded system where the hardware resources are limited. Secondly, we propose an automated hybrid interconnect design using data communication profiling to define an optimized interconnect for accelerator kernels of a generic hardware accelerator system. The hybrid interconnect consists of a network-on-chip (NoC), vii viii ABSTRACT shared local memory, or both. To minimize hardware resource usage for the hybrid interconnect, we also propose an adaptive mapping algorithm to connect the computing kernels and their local memories to the proposed hybrid interconnect. Thirdly, we propose a hardware accelerator architecture to support streaming image processing. In all presented approaches, we implement the approach using a number of benchmarks on relevant reconfigurable platforms to show their effectiveness. The experimental results show that our approaches not only improve system performance but also reduce overall energy consumption compared to the baseline systems.

show abstract

A framework for FPGA acceleration of large graph problems: Graphlet counting case study

Cited by 42 publications

References 16 publications

Engineering a Delegatable and Error-Tolerant Algorithm for Counting Small Subgraphs

Engineering a Delegatable and Error-Tolerant Algorithm for Counting Small Subgraphs

FPGA acceleration of semantic tree reasoning algorithms

Hybrid Interconnect Design for Heterogeneous Hardware Accelerators

Contact Info

Product

Resources

About