Template-based memory access engine for accelerators in SoCs

Li, Qing; Fang, Fang; Iyer,

doi:10.1109/aspdac.2011.5722175

Cited by 5 publications

(8 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Otherwise, we must create identical copies of the data in the write blocks (P b = W b × |L b |), each with capacity C PB = Height, as shown in the lower part of Fig. 5; in this way, each memoryread interface is assigned to a distinct parallel block and is guaranteed to access the data without conflicts [9] as long as the corresponding memory-write operations create consistent copies of the data in each bank.…”

Section: System-level Memory Optimizationmentioning

confidence: 99%

“…This determines the maximum number of parallel blocks and, thus, the minimum number of banks that are required to provide this bandwidth (lines 3-5). We also determine an initial size for these banks based on the data allocation strategy to be implemented (lines [6][7][8][9].…”

Section: Global Transformations 1) Definition Of Memory Subsystemmentioning

confidence: 99%

“…this model to perform decisions (e.g., to give suggestions to the users). Thanks to the inherent parallelism of their kernels, they are good candidates for hardware specialization, especially with loosely-coupled accelerators (LCAs) [7]- [9].…”

mentioning

confidence: 99%

See 2 more Smart Citations

System-Level Optimization of Accelerator Local Memory for Heterogeneous Systems-on-Chip

Pilato

Mantovani

Guglielmo

et al. 2016

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

In modern system-on-chip architectures, specialized accelerators are increasingly used to improve performance and energy efficiency. The growing complexity of these systems requires the use of system-level design methodologies featuring high-level synthesis (HLS) for generating these components efficiently. Existing HLS tools, however, have limited support for the system-level optimization of memory elements, which typically occupy most of the accelerator area. We present a complete methodology for designing the private local memories (PLMs) of multiple accelerators. Based on the memory requirements of each accelerator, our methodology automatically determines an areaefficient architecture for the PLMs to guarantee performance and reduce the memory cost based on technology-related information. We implemented a prototype tool, called MNEMOSYNE, that embodies our methodology within a commercial HLS flow. We designed 13 complex accelerators for selected applications from two recently-released benchmark suites (PERFECT and CORTEXSUITE). With our approach we are able to reduce the memory cost of single accelerators by up to 45%. Moreover, when reusing memory IPs across accelerators, we achieve area savings that range between 17% and 55% compared to the case where the PLMs are designed separately. Index Terms-Hardware accelerator, high-level synthesis (HLS), memory design, multibank architecture. I. INTRODUCTION S YSTEM-ON-CHIP (SoC) architectures increasingly feature hardware accelerators to achieve energy-efficient high performance [1]. Complex applications leverage these specialized components to improve the execution of selected computational kernels [2], [3]. For example, hardware accelerators for machine learning applications are increasingly used to identify underlying relations in massive unstructured data [4]-[6]. Many of these algorithms first build an internal model by analyzing very large data sets; then, they leverage Manuscript

show abstract

Section: System-level Memory Optimizationmentioning

confidence: 99%

Section: Global Transformations 1) Definition Of Memory Subsystemmentioning

confidence: 99%

See 1 more Smart Citation

System-Level Optimization of Accelerator Local Memory for Heterogeneous Systems-on-Chip

Pilato

Mantovani

Guglielmo

et al. 2016

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

show abstract

“…However, before this paper, there has been no comprehensive analysis of the effects of multiple accelerators processing concurrently large amounts of data accessed through off-chip memory. The effects of multiple accelerators accessing the same memory controller has been studied before as part of a work that proposes a configurable module to manage many access patterns [17]. This module, however, is tightly coupled with the controller and the extension to multiple memory controllers is not straightforward.…”

Section: Related Workmentioning

confidence: 99%

“…Architectural solutions for heterogeneous architectures are usually evaluated by simulation [8,17,21]. However, as the complexity of these architectures increases, this approach is becoming unfeasible.…”

Section: Related Workmentioning

confidence: 99%

Handling large data sets for high-performance embedded applications in heterogeneous systems-on-chip

Mantovani

Cota

Pilato

et al. 2016

Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

View full text Add to dashboard Cite

Local memory is a key factor for the performance of accelerators in SoCs. Despite technology scaling, the gap between on-chip storage and memory footprint of embedded applications keeps widening. We present a solution to preserve the speedup of accelerators when scaling from small to large data sets. Combining specialized DMA and address translation with a software layer in Linux, our design is transparent to user applications and broadly applicable to any class of SoCs hosting high-throughput accelerators. We demonstrate the robustness of our design across many heterogeneous workload scenarios and memory allocation policies with FPGA-based SoC prototypes featuring twelve concurrent accelerators accessing up to 768MB out of 1GB-addressable DRAM.

show abstract

COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators

Piccolboni,

Mantovani,

Di Guglielmo

et al. 2019

Preprint

View full text Add to dashboard Cite

Hardware accelerators are key to the e ciency and performance of system-on-chip (SoC) architectures. With high-level synthesis (HLS), designers can easily obtain several performance-cost trade-o implementations for each component of a complex hardware accelerator. However, navigating this design space in search of the Pareto-optimal implementations at the system level is a hard optimization task. We present COSMOS, an automatic methodology for the design-space exploration (DSE) of complex accelerators, that coordinates both HLS and memory optimization tools in a compositional way. First, thanks to the co-design of datapath and memory, COSMOS produces a large set of Pareto-optimal implementations for each component of the accelerator. Then, COSMOS leverages compositional design techniques to quickly converge to the desired trade-o point between cost and performance at the system level. When applied to the system-level design (SLD) of an accelerator for wide-area motion imagery (WAMI), COSMOS explores the design space as completely as an exhaustive search, but it reduces the number of invocations to the HLS tool by up to 14.6×. CCS Concepts: • Hardware → High-level and register-transfer level synthesis; Methodologies for EDA; • Computer systems organization → Architectures; Embedded systems;

show abstract

Template-based memory access engine for accelerators in SoCs

Cited by 5 publications

References 18 publications

System-Level Optimization of Accelerator Local Memory for Heterogeneous Systems-on-Chip

System-Level Optimization of Accelerator Local Memory for Heterogeneous Systems-on-Chip

Handling large data sets for high-performance embedded applications in heterogeneous systems-on-chip

COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators

Contact Info

Product

Resources

About