Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems

Choi, Jongsok; Nam, Kevin; Canis, Andrew; Anderson, Jason; Brown, Stephen D.; Czajkowski, Tomasz

doi:10.1109/fccm.2012.13

Cited by 49 publications

(36 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…= 0), one of the two BRAM ports is dedicated to communicate with the host (the same situation is reported in [Choi et al, 2012]). Therefore, a crossbar is used to share the local memories of the two kernels HW i and HW j .…”

Section: Modeling Shared Local Memorymentioning

confidence: 87%

See 1 more Smart Citation

Hybrid Interconnect Design for Heterogeneous Hardware Accelerators

Pham‐Quoc¹,

Heisswolf²,

Werner³

et al. 2013

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2013

View full text Add to dashboard Cite

Heterogeneous multicore systems are becoming increasingly important as the need for computation power grows, especially when we are entering into the big data era. As one of the main trends in heterogeneous multicore, hardware accelerator systems provide application specific hardware circuits and are thus more energy efficient and have higher performance than general purpose processors, while still providing a large degree of flexibility. However, system performance dose not scale when increasing the number of processing cores due to the communication overhead which increases greatly with the increasing number of cores. Although data communication is a primary anticipated bottleneck for system performance, the interconnect design for data communication among the accelerator kernels has not been well addressed in hardware accelerator systems. A simple bus or shared memory is usually used for data communication between the accelerator kernels. In this dissertation, we address the issue of interconnect design for heterogeneous hardware accelerator systems.Evidently, there are dependencies among computations, since data produced by one kernel may be needed by another kernel. Data communication patterns can be specific for each application and could lead to different types of interconnect. In this dissertation, we use detailed data communication profiling to design an optimized hybrid interconnect that provides the most appropriate support for the communication pattern inside an application while keeping the hardware resource usage for the interconnect minimal. Firstly, we propose a heuristicbased approach that takes application data communication profiling into account to design a hardware accelerator system with a custom interconnect. A number of solutions are considered including crossbar-based shared local memory, direct memory access (DMA) supporting parallel processing, local buffers, and hardware duplication. This approach is mainly useful for embedded system where the hardware resources are limited. Secondly, we propose an automated hybrid interconnect design using data communication profiling to define an optimized interconnect for accelerator kernels of a generic hardware accelerator system. The hybrid interconnect consists of a network-on-chip (NoC), vii viii ABSTRACT shared local memory, or both. To minimize hardware resource usage for the hybrid interconnect, we also propose an adaptive mapping algorithm to connect the computing kernels and their local memories to the proposed hybrid interconnect. Thirdly, we propose a hardware accelerator architecture to support streaming image processing. In all presented approaches, we implement the approach using a number of benchmarks on relevant reconfigurable platforms to show their effectiveness. The experimental results show that our approaches not only improve system performance but also reduce overall energy consumption compared to the baseline systems.

show abstract

Section: Modeling Shared Local Memorymentioning

confidence: 87%

“…Research in [Choi et al, 2012] proposed a multi-ported cache design for communication of multiple accelerator kernels in an FPGA-based accelerator system. However, this proposal is system-dependent since they assume that the on-chip memory can work at 2× the speed of the system clock (clock for kernels).…”

Section: Hardware Level Optimizationmentioning

confidence: 99%

Hybrid Interconnect Design for Heterogeneous Hardware Accelerators

Pham‐Quoc¹,

Heisswolf²,

Werner³

et al. 2013

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2013

View full text Add to dashboard Cite

show abstract

“…Recent work has also explored the design space of the cache micro-architecture [15][16][17][18][19]. Matthews et al [17] explore the efficiency in terms of speed-up versus area increase of parallel coherent L1 caches with respect to size, associativity and replacement rule in an FPGA-based soft multi-core processor.…”

Section: Related Workmentioning

confidence: 99%

“…Matthews et al [17] explore the efficiency in terms of speed-up versus area increase of parallel coherent L1 caches with respect to size, associativity and replacement rule in an FPGA-based soft multi-core processor. Similarly, Choi et al [18] compare different configurations of cache size, line size and associativity of shared on-chip caches, in addition to two approaches for increasing the number of access ports of the shared cache. FCache [16] and LEAP Coherent Memories [15] target the micro-architecture of coherency mechanisms for shared memory systems in FPGAs.…”

Section: Related Workmentioning

confidence: 99%

Custom Multicache Architectures for Heap Manipulating Programs

Winterstein

Fleming

Yang

et al. 2017

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

Memory-intensive implementations often require access to an external, off-chip memory which can substantially slow down an FPGA accelerator due to memory bandwidth limitations. Buffering frequently reused data on chip is a common approach to address this problem and the optimization of the cache architecture introduces yet another complex design space. This paper presents a high-level synthesis (HLS) design aid that automatically generates parallel multi-cache systems which are tailored to the specific requirements of the application. Our program analysis identifies non-overlapping memory regions, supported by private caches, and regions which are shared by parallel units after parallelization, which are supported by coherent caches and synchronization primitives. It also decides whether the parallelization is legal with respect to data dependencies. The novelty of this work is the focus on programs using dynamically allocated, pointer-based data structures which, while common in software engineering, remain difficult to analyze and are beyond the scope of the overwhelming majority of HLS techniques to date. Secondly, we devise a high-level cache performance estimation to find a heterogeneous configuration of cache sizes that maximizes the performance of the multi-cache system subject to an on-chip memory resource constraint. We demonstrate our technique with three case studies of applications using dynamic data structures and use Xilinx Vivado HLS as an exemplary HLS tool. We show up to 15× speed-up after parallelization of the HLS implementations and the insertion of the application-specific distributed hybrid multi-cache architecture. ¶

show abstract

“…Multipumping is also widely used in memories to "mimic" the availability of extra memory ports. Choi et al's work in [4] found that multi-pumped caches had the best performance and area for FPGA processor/parallel-accelerator systems. A Xilinx white paper [16] describes how multi-pumping can improve the throughput of a DSP block in isolation, outside of the HLS context.…”

Section: Related Workmentioning

confidence: 99%

Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis

Canis¹,

Anderson²,

Brown³

2013

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2013

Self Cite

View full text Add to dashboard Cite

Abstract-Resource sharing is a classic high-level synthesis (HLS) optimization that saves area by mapping multiple operations to a single functional unit. With resource sharing, only operations scheduled in separate cycles can be assigned to shared hardware, which can result in longer schedules. In this paper, we propose a new approach to resource sharing that allows multiple operations to be performed by a single functional unit in one clock cycle. Our approach is based on multi-pumping, which operates functional units at a higher frequency than the surrounding system logic, typically 2×, allowing multiple computations to complete in a single system cycle. Our approach is particularly effective for DSP blocks on an FPGA, which are used to perform multiply and/or accumulate operations. Our results show that resource sharing using multi-pumping is comparable to traditional resource sharing in terms of area saved, but provides significant performance advantages. Specifically, when targeting a 50% reduction in DSP blocks, traditional resource sharing decreases circuit speed performance by 80%, on average, whereas multi-pumping decreases circuit speed by just 5%. Multi-pumping is a viable approach to achieve the area reductions of resource sharing, with considerably less negative impact to circuit performance.

show abstract

Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems

Cited by 49 publications

References 18 publications

Hybrid Interconnect Design for Heterogeneous Hardware Accelerators

Hybrid Interconnect Design for Heterogeneous Hardware Accelerators

Custom Multicache Architectures for Heap Manipulating Programs

Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis

Contact Info

Product

Resources

About