Chen Huang scite author profile

IEEE Embedded Syst. Lett.

Givargis

2011

Abstract-Models of physical systems, such as of human physiology or of chemical reactions, are typically comprised of numerous ordinary differential equations (ODEs). Today's designers commonly consider simulating physical models utilizing field-programmable gate arrays (FPGAs). This letter introduces a resource efficient custom processor-the differential equation processing element, or DEPE-specifically designed for efficient solution of ODEs on FPGAs, and also introduces its accompanying compilation tools. We show that a single DEPE on a Xilinx Virtex6 130T FPGA executes several physiological models faster than real-time while requiring only a few hundred FPGA lookup tables (LUTs). Experiments with a commercial high-level synthesis(HLS) tool show that while a single DEPE is 5-50 slower than HLS circuits, DEPE is 10-200 smaller. We show that a single DEPE is only 10 slower than a relatively massive and costly 3 GHz Pentium 4 desktop processor for ODE solving, and its speed is also competitive with a 700 Mhz TI digital signal processor and an 450 Mhz ARM9 processor. DEPE is 4 -17 faster than a Xilinx MicroBlaze soft-core processor and 3 -6 smaller. DEPE thus represents an excellent processing element for use by itself for small physical models, and in future parallel networks for larger models.Index Terms-Custom processor, field-programmable gate array (FPGA), ordinary differential equation (ODE) solving, physical model simulation.

Dynamic coprocessor management for FPGA-enhanced compute platforms

2008

Various commercial programmable compute platforms have their processor architecture enhanced with field-programmable gate arrays (FPGAs). In a common usage scenario, an application loads custom processors into the FPGA to speed up application execution compared to processor-only execution. Transient applications, changing application workloads, and limited FPGA capacity have led to a new problem of operating-systemcontrolled dynamic management of the loading of coprocessors into the FPGAs for best overall performance or energy. We define the Dynamic Coprocessor Management problem and provide a mapping to an online optimization problem known as Metrical Task Systems. We introduce a robust heuristic, called the fading cumulative benefit (FCBenefit) heuristic, that outperforms other heuristics, including a previously developed one for MTS. For two distinct application sets, we generate numerous workloads and show that the FCBenefit heuristic provides best results across all considered workloads. In our simulations, the heuristic's results were within 9% of the offline optimal for performance, and within 3% for energy. The heuristic may be applicable to a wide variety of dynamic architecture management problems.

Automatic synthesis of physical system differential equation models to a custom network of general processing elements on FPGAs

ACM Trans. Embed. Comput. Syst.

Givargis

2013

Fast execution of physical system models has various uses, such as simulating physical phenomena or realtime testing of medical equipment. Physical system models commonly consist of thousands of differential equations. Solving such equations using software on microprocessor devices may be slow. Several past efforts implement such models as parallel circuits on special computing devices called Field-Programmable Gate Arrays (FPGAs), demonstrating large speedups due to the excellent match between the massive fine-grained local communication parallelism common in physical models and the fine-grained parallel compute elements and local connectivity of FPGAs. However, past implementation efforts were mostly manual or ad hoc. We present the first method for automatically converting a set of ordinary differential equations into circuits on FPGAs. The method uses a general Processing Element (PE) that we developed, designed to quickly solve a set of ordinary differential equations while using few FPGA resources. The method instantiates a network of general PEs, partitions equations among the PEs to minimize communication, generates each PE's custom program, creates custom connections among PEs, and maintains synchronization of all PEs in the network. Our experiments show that the method generates a 400-PE network on a commercial FPGA that executes four different models on average 15x faster than a 3 GHz Intel processor, 30x faster than a commercial 4-core ARM, 14x faster than a commercial 6-core Texas Instruments digital signal processor, and 4.4x faster than an NVIDIA 336-core graphics processing unit. We also show that the FPGA-based approach is reasonably cost effective compared to using the other platforms. The method yields 2.1x faster circuits than a commercial high-level synthesis tool that uses the traditional method for converting behavior to circuits, while using 2x fewer lookup tables, 2x fewer hardcore multiplier (DSP) units, though 3.5x more block RAM due to being programmable. Furthermore, the method does not just generate a single fastest design, but generates a range of designs that trade off size and performance, by using different numbers of PEs.

Synthesis of custom networks of heterogeneous processing elements for complex physical system emulation

Miller

et al. 2012

Physical system models that consist of thousands of ordinary differential equations can be synthesized to field-programmable gate arrays (FPGAs) for highly-parallelized, real-time physical system emulation. Previous work introduced synthesis of custom networks of homogeneous processing elements, consisting of processing elements that are either all general differential equation solvers or are all custom solvers tailored to solve specific equations. However, a complex physical system model may contain different types of equations such that using only general solvers or only custom solvers does not provide all of the possible speedup. We introduce methods to synthesize a custom network of heterogeneous processing elements for emulating physical systems, where each element is either a general or custom differential equation solver. We show average speedups of 45x over a 3 GHz single-core desktop processor, and of 11x and 20x over a 3 GHz four-core desktop and a 763 MHz NVIDIA graphical processing unit, respectively. Compared to a commercial high-level synthesis tool including regularity extraction, the networks of heterogeneous processing elements were on average 10.8x faster. Compared to homogeneous networks of general and single-type custom processing elements, heterogeneous networks were on average 7x and 6x faster, respectively.

Transmuting coprocessors

2009