Scratchpad memory

Banakar, R. M.; Steinke, Stefan; Lee, Bo-Sik; Balakrishnan, M.; Marwedel, Peter

doi:10.1145/774789.774805

Cited by 459 publications

(35 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Consider again the example of Figure 2(b) where a[0] and a [2] are received in-order but a [1] is delayed.…”

Section: Push Vs Pull Addressingmentioning

confidence: 99%

“…While these different accelerator propositions can achieve very significant energy and/or performance gains, the corresponding studies and designs are focused on the computational aspects of the accelerator, less so on its interface with the cache or memory system. However, there are several reasons why accelerator memory interface should receive greater attention: (1) unlike for most GPGPU applications, ASICs and CGRAs can be used to map applications with complex control flow, but the resulting memory access patterns can be very irregular, so that such accelerators cannot be plugged to traditional scratchpads, they must be plugged to caches, just like processors (a typical system organization would be processors and accelerators each plugged to private L1s, with shared L2s), (2) as accelerators reduce the energy spent in computations, the fraction of energy spent accessing memory will comparatively increase, a kind of Amdahl's law effect on energy, and (3) one of the key assets of accelerators is their reduced area, so one should take care that this area advantage is not outweighed by an over-sized memory interface.…”

Section: Introductionmentioning

confidence: 99%

“…(1) A cache can induce memory requests ordering and dependence issues that accelerators are ill-equipped to handle. (2) Conversely, an accelerator can have significant bandwidth requirements with a far higher number of memory ports than a traditional processor [13,24,9], so that simply scaling up a processor LSQ is not a reasonable solution. (3) But an inefficient interface can partly wipe out the area and power benefits of the accelerator, and/or induce an excessive area/power overhead, incompatible with the usually small size and multiplicity of accelerators.…”

Section: Introductionmentioning

confidence: 99%

“…While there is a broad literature on using DMAs for handling the regular memory accesses of accelerators directly plugged to an embedded or main memory [2], there are few existing options for connecting a multi-port accelerator to a multi-banked cache in order to handle irregular memory accesses. In fact, the only existing option for handling the complex access ordering issues raised by such a multi-port architecture (whether processor or accelerator) connected to a multi-banked cache (which can return data out-of-order) remains the Load-Store Queue (LSQ) used in OoO processors.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A low-cost memory interface for high-throughput accelerators

Huang

Temam

et al. 2014

Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems

View full text Add to dashboard Cite

Heterogeneous multi-cores, a mix of cores and accelerators, are becoming prevalent. These accelerators are designed for both speed and energy improvements, and thus, they increasingly come with a large number of load/store ports for achieving a high degree of parallelism. However, beyond GPGPUs, accelerators such as ASICs and CGRAs are increasingly capable of accelerating computations with irregular control flow and memory accesses; as a result, such accelerators need to be plugged to caches instead of scratchpads, and few studies focus on accelerator-to-cache interfaces. The main existing alternative are Load/Store Queues (LSQs) traditionally used to connect superscalar processors to caches and memory, but in the context of accelerators, they are overkill and could significantly reduce the area and power benefits of accelerators. Moreover, we show that they are just not fit for accelerators plugged to multi-banked caches.In this article, we propose a fast accelerator-to-cache interface with a moderate area and power footprint compared to LSQs, even for a large number of load/store ports. For that purpose, we introduce a set of low-overhead techniques for ensuring in-order delivery of requests to/from cache banks. We synthesize and layout at 65nm the design of both our interface and an LSQ specially adapted to accelerators for a fair comparison. We find that our interface achieves on average 78% of the performance of an LSQ using only 16% of the area and 24% of the power.

show abstract

“…Consider again the example of Figure 2(b) where a[0] and a [2] are received in-order but a [1] is delayed.…”

Section: Push Vs Pull Addressingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A low-cost memory interface for high-throughput accelerators

Huang

Temam

et al. 2014

Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems

View full text Add to dashboard Cite

show abstract

“…Although cached systems are already a standard for desktop machines, the hardware complexity required to keep the data history and maintain the coherency is extremely high, thus often unsuitable for many embedded applications. A more effective way of handling data transfers in embedded systems is that of coupling scratchpad memories with direct memory access (DMAs) controllers: It has been demonstrated that scratchpad memories are more energy efficient than caches for embedded applications [1]. However, programming requires the explicit configuration and trigger of memory transfers.…”

Section: Introductionmentioning

confidence: 99%

Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters

Rossi

Loi

Haugou

et al. 2014

Proceedings of the 11th ACM Conference on Computing Frontiers

View full text Add to dashboard Cite

The evolution of multi-and many-core platforms is rapidly increasing the available on-chip computational capabilities of embedded computing devices, while memory access is dominated by on-chip and off-chip interconnect delays which do not scale well. For this reason, the bottleneck of many applications is rapidly moving from computation to communication. More precisely, performance is often bound by the huge latency of direct memory accesses. In this scenario the challenge is to provide embedded multi and many-core systems with a powerful, low-latency, energy efficient and flexible way to move data through the memory hierarchy level. In this paper, a DMA engine optimized for clustered tightly coupled many-core systems is presented. The IP features a simple micro-coded programming interface and lock-free per-core command queues to improve flexibility while reducing the programming latency. Moreover it dramatically reduces the area and improves the energy efficiency with respect to conventional DMAs exploiting the cluster shared memory as local repository for data buffers. The proposed DMA engine improves the access and programming latency by one order of magnitude, it reduces IP area by 4x and power by 5x, with respect to a conventional DMA, while providing full bandwidth to 16 independent logical channels.

show abstract

Dynamic scratch‐pad memory management with data pipelining for embedded systems

Yang

Wang

Yan

et al. 2010

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYIn this paper, we propose an effective data pipelining technique, SPDP (Scratch-Pad Data Pipelining), for dynamic scratch-pad memory (SPM) management with DMA (Direct Memory Access). Our basic idea is to overlap the execution of CPU instructions and DMA operations. In SPDP, based on the iteration access patterns of arrays, we group multiple iterations into a block to improve the data locality of regular array accesses. We allocate the data of multiple iterations into different portions of the SPM. In this way, when the CPU executes instructions and accesses data from one portion of the SPM, DMA operations can be performed to transfer data between the off-chip memory and another portion of SPM simultaneously. We perform code transformation to insert DMA instructions to achieve the data pipelining. We have implemented our SPDP technique with the IMPACT compiler, and conduct experiments using a set of loop kernels from DSPstone, Mibench, and Mediabench on the cycle-accurate VLIW simulator of Trimaran. The experimental results show that our technique achieves performance improvement compared with the previous work.

show abstract

Scratchpad memory

Cited by 459 publications

References 6 publications

A low-cost memory interface for high-throughput accelerators

A low-cost memory interface for high-throughput accelerators

Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters

Dynamic scratch‐pad memory management with data pipelining for embedded systems

Contact Info

Product

Resources

About