A fast parallel matrix multiplication reconfigurable unit utilized in face recognitions systems

Sotiropoulos, I.; Papaefstathiou, Ioannis

doi:10.1109/fpl.2009.5272287

Cited by 13 publications

(8 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They concluded that for the smaller data size the FPGA was faster and the GPU was faster at the larger data size. Sotiropoulos et al designed an FPGA matrix-matrix multiplication architecture [3] and compared its performance to a standard CPU implementation. This comparison was only for specifically sized matrices and did not discuss their CPU implementation.…”

Section: Related Workmentioning

confidence: 99%

Distributed execution of transmural electrophysiological imaging with CPU, GPU, and FPGA

Skalicky

López

Lukowiak

2013

2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig)

View full text Add to dashboard Cite

One of the main challenges of using cutting edge medical imaging applications in the clinical setting is the large amount of data processing required. Many of these applications are based on linear algebra computations operating on large data sizes and their execution may require days in a standard CPU. Distributed heterogeneous systems are capable of improving the performance of applications by using the right computationto-hardware mapping. To achieve high performance, hardware platforms are chosen to satisfy the needs of each computation with corresponding architectural features such as clock speed, number of parallel computational units, and memory bandwidth. In this paper we evaluate the performance benefits of using different hardware platforms to accelerate the execution of a transmural electrophysiological imaging algorithm, targeting a standard CPU with GPU and FPGA accelerators. Using this cutting edge medical imaging application as a case study, we demonstrate the importance of making intelligent computation assignments for improved performance. We show that, depending on the size of the data structures the application works with, the usage of an FPGA to run certain computations can make a big difference: a heterogeneous system with all three hardware platforms (CPU+GPU+FPGA) can cut the execution time by half, compared to the best result using one single accelerator (CPU+GPU). In addition, our experimental results show that combining CPU, GPU, and FPGA platforms in a single system achieves a speedup of up to 62x, 2x, and 1605x compared to systems with a single CPU, GPU, or FPGA platform respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Distributed execution of transmural electrophysiological imaging with CPU, GPU, and FPGA

Skalicky

López

Lukowiak

2013

2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig)

View full text Add to dashboard Cite

show abstract

“…To evaluate our approach, we compare the matrix-matrix multiplication against two existing approaches [45] and [46]. These approaches implement a blocked matrix multiplication algorithm with fixed-point arithmetic on FPGAs.…”

Section: Comparison With Existing Approachesmentioning

confidence: 99%

“…Compared to [45], our approach is up to four times faster, since our approach extracts higher parallelism by exploiting MapReduce and pipelining. Compared to [46], our approach is slower by a factor of 1.2 to 3.8, because Sotiropoulos and Papaefstathiou [46] use double buffering to pipeline data input/output with computation. As matrix sizes increase, time spent on data loading/unloading increases and thus the performance difference between our approach and [46] increases.…”

Section: Comparison With Existing Approachesmentioning

confidence: 99%

“…Compared to [46], our approach is slower by a factor of 1.2 to 3.8, because Sotiropoulos and Papaefstathiou [46] use double buffering to pipeline data input/output with computation. As matrix sizes increase, time spent on data loading/unloading increases and thus the performance difference between our approach and [46] increases. In future, we will integrate outer loop pipelining into our model to bridge this performance gap.…”

Section: Comparison With Existing Approachesmentioning

confidence: 99%

See 1 more Smart Citation

Optimizing Hardware Design by Composing Utility-Directed Transformations

Liu

Todman

Luk

et al. 2012

IEEE Trans. Comput.

View full text Add to dashboard Cite

Utility-directed transformations involve changing a design to optimize for given constraints while preserving behavior. These changes are often achieved by techniques such as linear programming or geometric programming. We present a systematic approach composing multiple utility-directed transformations for optimizing and mapping a sequential design onto a customizable parallel computing platform such as a Field-Programmable Gate Array (FPGA). Our aim is to enable automatic design optimization at compile time. Design goals specified by users drive the design transformations. Each utility-directed transformation achieves part of the overall goal, and multiple utility-directed transformations, connected by pattern-directed transformations, are composed to fulfill the overall design requirements. The utility-directed transformations in this work produce performance-optimized designs by exploiting data reuse, MapReduce, and pipelining for the target parallel computing platform. Moreover, it is shown that performing transformations in different orders allows users to trade speed for resources, and design performance for compile time. Several applications are used to evaluate this approach on FPGAs. The system performance of a 64-bit matrix multiplication is shown to improve up to 98 times compared to the original design, in the target hardware platform.

show abstract

“…Some designs focus on implementing image processing operations which have not been accomplished on the FPGA platform yet (Kokufuta and Maruyama, 2009). New design methodologies of image processing algorithms are proposed (Plavec et al, 2009), while existing algorithms are accelerated by implementing computing intensive routines in FPGA resources (Sotiropoulos and Papaefstathiou, 2009). Comparison of speed-up factor for various implementation platforms, i.e., GPU, CPU (GPP) and the FPGA, is also considered (Asano et al, 2009;Claus et al, 2009).…”

Section: Introductionmentioning

confidence: 99%