Shice Ni scite author profile

This paper presents a multi-channel memory based architecture for parallel processing of large-scale graph traversal for fieldprogrammable gate array (FPGA). By designing a multi-channel memory subsystem with two DRAM modules and two SRAM chips and developing an optimized pipelining structure for the processing elements, we achieve superior performance to that of a state-of-the-art highly optimized BFS implementations using the same type of FPGA.

show abstract

High performance sparse matrix-vector multiplication on FPGA

Zou

Dou

Guo

et al. 2013

IEICE Electron. Express

View full text Add to dashboard Cite

This paper presents the design and implementation of a high performance sparse matrix-vector multiplication (SpMV) on fieldprogrammable gate array (FPGA). By proposing a new storage format to compress the indexes of non-zero elements by exploiting the substructure of the sparse matrix, our SpMV implementation on a reconfigurable computing platform with a multi-channel memory subsystem is capable of obtaining similar performance by using a single FPGA to that of a highly optimized BFS implementation on a commercial heterogeneous system containing four FPGAs.

show abstract

A Novel Memory-Scheduling Strategy for Large Convolutional Neural Network on Memory-Limited Devices

Shen

Dou

et al. 2019

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

Recently, machine learning, especially deep learning, has been a core algorithm to be widely used in many fields such as natural language processing, speech recognition, object recognition, and so on. At the same time, another trend is that more and more applications are moved to wearable and mobile devices. However, traditional deep learning methods such as convolutional neural network (CNN) and its variants consume a lot of memory resources. In this case, these powerful deep learning methods are difficult to apply on mobile memory-limited platforms. In order to solve this problem, we present a novel memory-management strategy called mmCNN in this paper. With the help of this method, we can easily deploy a trained large-size CNN on any memory size platform such as GPU, FPGA, or memory-limited mobile devices. In our experiments, we run a feed-forward CNN process in some extremely small memory sizes (as low as 5 MB) on a GPU platform. The result shows that our method saves more than 98% memory compared to a traditional CNN algorithm and further saves more than 90% compared to the state-of-the-art related work “vDNNs” (virtualized deep neural networks). Our work in this paper improves the computing scalability of lightweight applications and breaks the memory bottleneck of using deep learning method on memory-limited devices.

show abstract

Design and Implementation of the Parameterized Multi-Standard High-Throughput Radix-4 Viterbi Decoder on FPGA

Dou

Lei

et al. 2012

IEICE Trans. Commun.

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Shice Ni

An Efficient Parallel SOVA-Based Turbo Decoder for Software Defined Radio on GPU

Parallel graph traversal for FPGA

High performance sparse matrix-vector multiplication on FPGA

A Novel Memory-Scheduling Strategy for Large Convolutional Neural Network on Memory-Limited Devices

Design and Implementation of the Parameterized Multi-Standard High-Throughput Radix-4 Viterbi Decoder on FPGA

Contact Info

Product

Resources

About