Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2015 2015
DOI: 10.7873/date.2015.1033
|View full text |Cite
|
Sign up to set email alerts
|

Inter-Tile Reuse Optimization Applied to Bandwidth Constrained Embedded Accelerators

Abstract: Abstract-The adoption of High-Level Synthesis (HLS) tools has significantly reduced accelerator design time. A complex scaling problem that remains is the data transfer bottleneck. To scale-up performance accelerators require huge amounts of data, and are often limited by interconnect resources. In addition, the energy spent by the accelerator is often dominated by the transfer of data, either in the form of memory references or data movement on interconnect. In this paper we drastically reduce accelerator com… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
12
0

Year Published

2015
2015
2021
2021

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 12 publications
(12 citation statements)
references
References 12 publications
(14 reference statements)
0
12
0
Order By: Relevance
“…(2) Data tiling: A number of data tiling techniques for efficient memory accesses are reported. In [5], the tiling operation of 2D data for an embedded hardware accelerator is presented. When an application code has nested loops, the memory transfers of 2D (rectangular) data can be reduced using the loop-tiled operation and its scheduling.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…(2) Data tiling: A number of data tiling techniques for efficient memory accesses are reported. In [5], the tiling operation of 2D data for an embedded hardware accelerator is presented. When an application code has nested loops, the memory transfers of 2D (rectangular) data can be reduced using the loop-tiled operation and its scheduling.…”
Section: Related Workmentioning
confidence: 99%
“…In [7], page table walk overheads are reduced by exploiting the shared pages among the accelerators. Our work differs from [5]- [7] in that we present an address layout transformation taking a virtual memory mapping into account.…”
Section: Related Workmentioning
confidence: 99%
“…To determine a tile size, they only enumerated 100 tile sizes with different power-of-two values on each loop dimensions. Unlike such limited enumeration, Peemen et al [8] propose to construct a specific cost model considering both data reuse and loop transformation for a given loop. Then, they use a bounded enumeration with this model, but its search time can still grow quickly when the search space expands with the increase of available on-chip memory.…”
Section: Related Workmentioning
confidence: 99%
“…To exploit data locality, recent research [6], [7], [8] has suggested applying loop transformations to gather data-related iterations into a loop tile. The data elements accessed by these iterations are close to each other in terms of addressing distance, and therefore they can be packed into the on-chip memory.…”
Section: Introductionmentioning
confidence: 99%
“…For example, [3,[10][11][12]22] focus on 2D-convolvers, which play the roles of both compute modules and data caches. Meanwhile, [18,19] use FMA units for computation. The key differences between these approaches are the order of data transfer and the choice of memory organization.…”
Section: Related Workmentioning
confidence: 99%