Three-dimensional (3D) deconvolution is widely used in many computer vision applications. However, most previous works have only focused on accelerating two-dimensional (2D) deconvolutional neural networks (DCNNs) on Field-Programmable Gate Arrays (FPGAs), while the acceleration of 3D DCNNs has not been well studied in depth as they have higher computational complexity and sparsity than 2D DCNNs. In this paper, we focus on the acceleration of both 2D and 3D sparse DCNNs on FPGAs by proposing efficient schemes for mapping 2D and 3D sparse DCNNs on a uniform architecture. Firstly, a pruning method is used to prune unimportant network connections and increase the sparsity of weights. After being pruned, the number of parameters of DCNNs is reduced significantly without accuracy loss. Secondly, the remaining non-zero weights are encoded in coordinate (COO) format, reducing the memory demands of parameters. Finally, to demonstrate the effectiveness of our work, we implement our accelerator design on the Xilinx VC709 evaluation platform for four real-life 2D and 3D DCNNs. After the first two steps, the storage required of DCNNs is reduced up to 3.9×. Results show that the performance of our method on the accelerator outperforms that of the our prior work by 2.5× to 3.6× in latency.
Abstract-Convolution operation is the most important and time consuming step in a convolution neural network model. In this work, we analyze the computing complexity of direct convolution and fast-Fourier-transform-based (FFT-based) convolution. We creatively propose CS-unit, which is equivalent to a combination of a convolutional layer and a pooling layer but more effective. Theoretical computing complexity of and some other similar operation is demonstrated, revealing an advantage on computation of CS-unit. Also, practical experiments are also performed and the result shows that CS-unit holds a real superiority on run time.
Per-flow traffic measurement has emerged as a critical but challenging task in data centers in recent years in the face of massive network traffic. Many approximate methods have been proposed to resolve the existing resource-accuracy trade-off in per-flow traffic measurement, one of which is the sketch-based method. However, sketches are affected by their high computational cost and low throughput; moreover, their measurement accuracy is hard to guarantee under the conditions of changing network bandwidth or flow size distribution. Recently, FPGAplatforms have been widely deployed in data centers, as they demonstrate a good fit for high-speed network processing. In this work, we aim to address the problem of per-flow traffic measurement from a hardware architecture perspective. We thus design SAPTM, a pipelined systolic array-like architecture for high-throughput per-flow traffic measurement on FPGA. We adopt memory-friendly D-left hashing in the design of SAPTM, which guarantees high space utilization during flow insertion and eviction, successfully addressing the challenge of tracking a high-speed data stream under limited memory resources on FPGA. Evaluations on the Xilinx VCU118 platform with real-world benchmarks demonstrate that SAPTM possesses high space utilization. Comparisons with state-of-the-art sketch-based solutions show that SAPTM outperforms comparison methods in terms of throughput by a factor of 14.1x–70.5x without any accuracy loss.
Network algorithms are building blocks of network applications. They are inspired by emerging commodity programmable switches and the Programming Protocol-Independent Packet Processors (P4) language. P4 aims to provide target-independent programming neglecting the architecture of underlying infrastructure. However, commodity programmable switches have tight programming restrictions due to limited resources and latency. In addition, manufacturers tailor P4 according to their architecture, putting more restrictions on it. These intrinsic and extrinsic restrictions dilute the goal of P4. This paper proposes P4 high-level programming (P4HLP) framework, a suite of toolchains that simplifies P4 programming. The paper highlights three aspects: (i) E-Domino, a high-level programming language that defines both stateless and stateful processing of data plane in C-style codes; (ii) P4HLPc, a compiler that automatically generates P4 programs from E-Domino programs, which removes the barrier between high-level programming and low-level P4 primitives; (iii) modular programming that organizes programs into reusable modules, to enable fast reconfiguration of commodity switches. Results show that P4HLPc is efficient and robust, thus is suitable for data plane high-level programming. Compared with P4, E-Domino saves at least 5.5× codes to express the data plane algorithm. P4HLPc is robust to policy change and topology change. The generated P4 programs achieve line-rate processing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.