Ahmed Sanaullah scite author profile

The implementation of Molecular Dynamics (MD) on FPGAs has received substantial attention. Previous work, however, has consisted of either proof-of-concept implementations of components, usually the range-limited force; full systems, but with much of the work shared by the host CPU; or prototype demonstrations, e.g., using OpenCL, that neither implement a whole system nor have competitive performance. In this paper, we present what we believe to be the first full-scale FPGA-based simulation engine, and show that its performance is competitive with a GPU (running Amber in an industrial production environment). The system features on-chip particle data storage and management, short-and long-range force evaluation, as well as bonded forces, motion update, and particle migration. Other contributions of this work include exploring numerous architectural trade-offs and analysis on various mappings schemes among particles/cells and the various on-chip compute units. The potential impact is that this system promises to be the basis for long timescale Molecular Dynamics with a commodity cluster.

show abstract

A Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Work and Weight Load Balancing

Geng

Wang

Sanaullah

et al. 2018

View full text Add to dashboard Cite

Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that to make the distributed cluster work with high utilization, the workload distributed to each node must be large, which implies nontrivial growth in the SGD mini-batch size. In this paper, we propose a framework called FPDeep, which uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train DNNs. This approach has numerous benefits. First, the design does not suffer from batch size growth. Second, novel workload and weight partitioning leads to balanced loads of both among nodes. And third, the entire system is a fine-grained pipeline. This leads to high parallelism and utilization and also minimizes the time features need to be cached while waiting for back-propagation. As a result, storage demand is reduced to the point where only on-chip memory is used for the convolution layers. We evaluate FPDeep with the Alexnet, VGG-16, and VGG-19 benchmarks. Experimental results show that FPDeep has good scalability to a large number of FPGAs, with the limiting factor being the FPGA-to-FPGA bandwidth. With 6 transceivers per FPGA, FPDeep shows linearity up to 83 FPGAs. Energy efficiency is evaluated with respect to GOPs/J. FPDeep provides, on average, 6.36x higher energy efficiency than comparable GPU servers.Deep convolutional neural networks (CNNs) have revolutionized applications such as image classification and object recognition [1,2,3,4]. As there remains an open-ended demand for more complex networks and larger datasets, new computing solutions are critical.Distributed synchronous stochastic gradient descent (SGD) has enabled large-scale CNN training by partitioning SGD mini-batches into smaller data batches that can then be processed in parallel and so accelerate CNN training [5]. A drawback of this method is scalability: to enable continued high utilization as the number of nodes increases, each node must be allocated an ever larger workload. But larger mini-batches slow training convergence. Thus, while larger clusters provide increased computation capacity, the training time is not proportionally reduced [5].

show abstract

Real-time data analysis for medical diagnosis using FPGA-accelerated neural networks

et al. 2018

View full text Add to dashboard Cite

BackgroundReal-time analysis of patient data during medical procedures can provide vital diagnostic feedback that significantly improves chances of success. With sensors becoming increasingly fast, frameworks such as Deep Neural Networks are required to perform calculations within the strict timing constraints for real-time operation. However, traditional computing platforms responsible for running these algorithms incur a large overhead due to communication protocols, memory accesses, and static (often generic) architectures. In this work, we implement a low-latency Multi-Layer Perceptron (MLP) processor using Field Programmable Gate Arrays (FPGAs). Unlike CPUs and Graphics Processing Units (GPUs), our FPGA-based design can directly interface sensors, storage devices, display devices and even actuators, thus reducing the delays of data movement between ports and compute pipelines. Moreover, the compute pipelines themselves are tailored specifically to the application, improving resource utilization and reducing idle cycles. We demonstrate the effectiveness of our approach using mass-spectrometry data sets for real-time cancer detection.ResultsWe demonstrate that correct parameter sizing, based on the application, can reduce latency by 20% on average. Furthermore, we show that in an application with tightly coupled data-path and latency constraints, having a large amount of computing resources can actually reduce performance. Using mass-spectrometry benchmarks, we show that our proposed FPGA design outperforms both CPU and GPU implementations, with an average speedup of 144x and 21x, respectively.ConclusionIn our work, we demonstrate the importance of application-specific optimizations in order to minimize latency and maximize resource utilization for MLP inference. By directly interfacing and processing sensor data with ultra-low latency, FPGAs can perform real-time analysis during procedures and provide diagnostic feedback that can be critical to achieving higher percentages of successful patient outcomes.

show abstract

Design and implementation of a low cost Solar Panel emulator

Sanaullah

Khan

2015

View full text Add to dashboard Cite

Maximum Power Point Tr acking (MPPT) is an essential element of a PV system. Bench power supplies provide a reasonable approximation of solar panel behavior and can be used for initial testing. However, detailed testing with actual solar panels is required to accurately establish the performance of MPPTs. This could be a difficult and time consuming task, especially for medium and high power ratings, owing to constraining factors such as panel sizes, panel availability, testing area, daylight hours etc. An alternative to this is the use of Solar Panel Emulators (SPEs) which mimic the behavior of the panel with a reasonable accuracy. These emulators can be used to test circuits indoors for a variety of power ratings and I-V profiles while being significantly smaller in size. In this paper, we have designed and implemented a medium power SPE based on diode approximation model. By minimizing component usage, the cost of the emulator is largely reduced. The proposed circuit is simulated and then tested with loads of up to 60W with satisfactory results.Index Termsdiode model, implementation, low cost, solar panel emulator.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ahmed Sanaullah

FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters

Fully integrated FPGA molecular dynamics simulations

A Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Work and Weight Load Balancing

Real-time data analysis for medical diagnosis using FPGA-accelerated neural networks

Design and implementation of a low cost Solar Panel emulator

Contact Info

Product

Resources

About