Andrew Bitar scite author profile

Andrew Bitar

5Publications

52Citation Statements Received

68Citation Statements Given

How they've been cited

121

How they cite others

Affiliations

IQE (United Kingdom), University of Toronto, Intel (United Kingdom)

Publications

Order By: Most citations

DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration

Abdelfattah¹,

Han²,

Bitar³

et al. 2018

View full text Add to dashboard Cite

Overlays have shown significant promise for fieldprogrammable gate-arrays (FPGAs) as they allow for fast development cycles and remove many of the challenges of the traditional FPGA hardware design flow. However, this often comes with a significant performance burden resulting in very little adoption of overlays for practical applications. In this paper, we tailor an overlay to a specific application domain, and we show how we maintain its full programmability without paying for the performance overhead traditionally associated with overlays. Specifically, we introduce an overlay targeted for deep neural network inference with only~1% overhead to support the control and reprogramming logic using a lightweight very-long instruction word (VLIW) network. Additionally, we implement a sophisticated domain specific graph compiler that compiles deep learning languages such as Caffe or Tensorflow to easily target our overlay. We show how our graph compiler performs architecture-driven software optimizations to significantly boost performance of both convolutional and recurrent neural networks (CNNs/RNNs) -we demonstrate a 3× improvement on ResNet-101 and a 12× improvement for long short-term memory (LSTM) cells, compared to naïve implementations. Finally, we describe how we can tailor our hardware overlay, and use our graph compiler to achieve~900 fps on GoogLeNet on an Intel Arria 10 1150 -the fastest ever reported on comparable FPGAs.

show abstract

Efficient and programmable ethernet switching with a NoC-enhanced FPGA

Bitar

Cassidy

Jerger

et al. 2014

View full text Add to dashboard Cite

Communications systems make heavy use of FPGAs; their programmability allows system designers to keep up with emerging protocols and their high-speed transceivers enable high bandwidth designs. While FPGAs are extensively used for packet parsing, inspection and classification, they have seen less use as the switch fabric between network ports. However, recent work has proposed embedding a networkon-chip (NoC) as a new "hard" resource on FPGAs and we show that by properly leveraging such a NoC one can create a very efficient yet still highly programmable network switch.We compare a NoC-based 16×16 network switch for 10-Gigabit Ethernet traffic to a recent innovative FPGA-based switch fabric design. The NoC-based switch not only consumes 5.8× less logic area, but also reduces latency by 8.1×. We also show that using the FPGA's programmable interconnect to adjust the packet injection points into the NoC leads to significant performance improvements. A routing algorithm tailored to this application is shown to further improve switch performance and scalability. Overall, we show that an FPGA with a low-cost hard 64-node mesh NoC with 64-bit links can support a 16×16 switch with up to 948 Gbps in aggregate bandwidth, roughly matching the transceiver bandwidth on the latest FPGAs.

show abstract

Design and Applications for Embedded Networks-on-Chip on FPGAs

Abdelfattah

Bitar

Betz

2017

IEEE Trans. Comput.

View full text Add to dashboard Cite

A software-defined tensor streaming multiprocessor for large-scale machine learning

Abts

Kimmell

Ling

et al. 2022

View full text Add to dashboard Cite

We describe our novel commercial software-defined approach for large-scale interconnection networks of tensor streaming processing (TSP) elements. The system architecture includes packaging, routing, and flow control of the interconnection network of TSPs. We describe the communication and synchronization primitives of a bandwidth-rich substrate for global communication. This scalable communication fabric provides the backbone for large-scale systems based on a software-defined Dragonfly topology, ultimately yielding a parallel machine learning system with elasticity to support a variety of workloads, both training and inference. We extend the TSP's producer-consumer stream programming model to include global memory which is implemented as logically shared, but physically distributed SRAM on-chip memory. Each TSP contributes 220 MiBytes to the global memory capacity, with the maximum capacity limited only by the network's scale -the maximum number of endpoints in the system. The TSP acts as both a processing element (endpoint) and network switch for moving tensors across the communication links. We describe a novel software-controlled networking approach that avoids the latency variation introduced by dynamic contention for network links. We describe the topology, routing and flow control to characterize the performance of the network that serves as the fabric for a large-scale parallel machine learning system with up to 10,440 TSPs and more than 2 TeraBytes of global memory accessible in less than 3 microseconds of end-to-end system latency.

show abstract

DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration

Abdelfattah¹,

Han²,

Bitar³

et al. 2018

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Andrew Bitar

DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration

Efficient and programmable ethernet switching with a NoC-enhanced FPGA

Design and Applications for Embedded Networks-on-Chip on FPGAs

A software-defined tensor streaming multiprocessor for large-scale machine learning

DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration

Contact Info

Product

Resources

About