Inter-Tile Reuse Optimization Applied to Bandwidth Constrained Embedded Accelerators

Peemen, Maurice; Mesman, Bart; Corporaal, Henk

doi:10.7873/date.2015.1033

Cited by 12 publications

(12 citation statements)

References 12 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(2) Data tiling: A number of data tiling techniques for efficient memory accesses are reported. In [5], the tiling operation of 2D data for an embedded hardware accelerator is presented. When an application code has nested loops, the memory transfers of 2D (rectangular) data can be reduced using the loop-tiled operation and its scheduling.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Virtual Address Remapping with Configurable Tiles in Image Processing Applications

Hur

2020

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

The conventional linear or tiled address maps can degrade performance and memory utilization when traffic patterns are not matched with an underlying address map. The address map is usually fixed at design time. Accordingly, it is difficult to adapt to given applications. Modern embedded system usually accommodates memory management units (MMUs). As a result, depending on virtual address patterns, the system can suffer from performance overheads due to page table walks. To alleviate this performance overhead, we propose to cluster and rearrange tiles to construct an MMU-aware configurable address map. To construct the clustered tiled map, the generic tile number remapping algorithm is presented. In the presented scheme, an address map is configured based on the adaptive dimensioning algorithm. Considering image processing applications, a design, an analysis, an implementation, and simulations are conducted. The results indicate the proposed method can improve the performance and the memory utilization with moderate hardware costs.

show abstract

Section: Related Workmentioning

confidence: 99%

“…In [7], page table walk overheads are reduced by exploiting the shared pages among the accelerators. Our work differs from [5]- [7] in that we present an address layout transformation taking a virtual memory mapping into account.…”

Section: Related Workmentioning

confidence: 99%

Virtual Address Remapping with Configurable Tiles in Image Processing Applications

Hur

2020

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…To determine a tile size, they only enumerated 100 tile sizes with different power-of-two values on each loop dimensions. Unlike such limited enumeration, Peemen et al [8] propose to construct a specific cost model considering both data reuse and loop transformation for a given loop. Then, they use a bounded enumeration with this model, but its search time can still grow quickly when the search space expands with the increase of available on-chip memory.…”

Section: Related Workmentioning

confidence: 99%

“…To exploit data locality, recent research [6], [7], [8] has suggested applying loop transformations to gather data-related iterations into a loop tile. The data elements accessed by these iterations are close to each other in terms of addressing distance, and therefore they can be packed into the on-chip memory.…”

Section: Introductionmentioning

confidence: 99%

Tile size selection for optimized memory reuse in high-level synthesis

Liu

Wickerson

Constantinides

2017

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

View full text Add to dashboard Cite

High-level synthesis (HLS) is well capable of generating control and computation circuits for FPGA accelerators, but still requires sufficient human effort to tackle the challenge of memory and communication bottlenecks. One important approach for improving data locality is to apply loop tiling on memory-intensive loops. Loop tiling is a well-known compiler technique that partitions the iteration space of a loop nest into chunks (or 'tiles') whose associated data can fit into sizeconstrained fast memory. The size of the tiles, which can significantly affect the memory requirement, is usually determined by partial enumeration. In this paper, we propose an analytical methodology to select a tile size for optimized memory reuse in HLS. A parametric polyhedral model is introduced to capture memory usage analytically for arbitrary tile sizes. To determine the tile size for data reuse in constrained on-chip memory, an algorithm is then developed to optimize over this model, using non-linear solvers to minimize communication overhead. Experimental results on three representative loops show that, compared to random enumeration with the same time budget, our proposed method can produce tile sizes that lead to a 75% average reduction in communication overhead. A case study with real hardware prototyping also demonstrates the benefits of using the proposed tile size selection. for(i=0; i

show abstract

“…For example, [3,[10][11][12]22] focus on 2D-convolvers, which play the roles of both compute modules and data caches. Meanwhile, [18,19] use FMA units for computation. The key differences between these approaches are the order of data transfer and the choice of memory organization.…”

Section: Related Workmentioning

confidence: 99%

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

ShenYongming

FerdmanMichael

MilderPeter

2017

SIGARCH Comput. Archit. News

133

149

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and efficiency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to inefficient designs because the same processor structure is used to compute CNN layers of radically varying dimensions.We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational efficiency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x.

show abstract

Inter-Tile Reuse Optimization Applied to Bandwidth Constrained Embedded Accelerators

Cited by 12 publications

References 12 publications

Virtual Address Remapping with Configurable Tiles in Image Processing Applications

Virtual Address Remapping with Configurable Tiles in Image Processing Applications

Tile size selection for optimized memory reuse in high-level synthesis

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

Contact Info

Product

Resources

About