Way Stealing: A Unified Data Cache and Architecturally Visible Storage for Instruction Set Extensions

Kluter, Theo; Brisk, Philip; Charbon, Edoardo; Ienne, Paolo

doi:10.1109/tvlsi.2012.2236689

Cited by 4 publications

(3 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, companies such as ARM [184] and Synopsys [158] provide hardware IPs [185] [186] for easy integration while providing better performance and power efficiency. In a recent work Kluter et al [187] proposed to integrate the local memory for custom instructions into the data cache in order to obviate the need of cache-coherence protocols. Moreover, they placed an additional constraint on the local memory blocks to be single-ported so as to successfully "merge" it with the data cache.…”

Section: Related Workmentioning

confidence: 99%

“…Moreover, they placed an additional constraint on the local memory blocks to be single-ported so as to successfully "merge" it with the data cache. Unlike their previous work using DMA, in [187] they used prefetch instructions to transfer data between the main memory and one of the ways of the data cache where the local memory for custom instructions reside. Although they avoided using DMA for data transfer, their communication cost is still nonzero [187].…”

Section: Related Workmentioning

confidence: 99%

“…Unlike their previous work using DMA, in [187] they used prefetch instructions to transfer data between the main memory and one of the ways of the data cache where the local memory for custom instructions reside. Although they avoided using DMA for data transfer, their communication cost is still nonzero [187]. However, our work strives to avoid using any data transfers as well as a need for cache-coherence protocols.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Constraint-aware configurable system-on-chip design for embedded computing

Prakash¹

View full text Add to dashboard Cite

Field Programmable Gate Arrays (FPGAs) are rapidly becoming a popular alternative to ASICs as they continue to increase in capacity, functionality and performance. At the same time, FPGA developers are faced with the challenges of meeting increasingly aggressive design constraints such as power, delay and area costs without violating shorter Time-to-Market (TTM) pressures and lower Non-Recurring Engineering (NRE) costs for embedded systems development. In this research, efficient techniques have been proposed for processor subsetting and customization as well as the rapid generation of application-specific hardware accelerators in order to meet the design constraints of configurable System-on-Chip (SoC) platforms. A processor-agnostic technique has been devised for sub-setting soft-core processors by relying on LLVM compiler generated front-end application output. The proposed approach has resulted in a systematic method for the application-aware sub-setting of the micro-architecture subsystems such as hardware multipliers and floating point units of a soft-core processor. Evaluations based on widely used benchmarks show that the proposed method can be deployed to reliably subset soft core processors at high-speed without compromising compute performance. A technique for the architecture-aware enumeration of custom instructions has been proposed next to identify area-efficient custom instructions by employing FPGA resource-aware pruning of the search space. Experimental results based on applications from widely-used benchmark suites confirm that deploying custom instructions identified in this way can improve compute performance by up to 65%. The instruction level parallelism (ILP) has also been exploited to further improve the compute performance by identifying profitable coarsegrained custom instructions. It has been demonstrated that the custom instructions using the proposed method can accelerate computations by up to 39% when compared to a base processor only implementation. Unlike traditional custom instruction generation methods that are incapable of incorporating memory-dependent basic blocks, a novel technique for accelerating memory-dependent basic blocks has been proposed. A detailed data dependency analysis based on pre-defined memory allocation in an application has been developed to guarantee the identification of ix profitable basic blocks for hardware acceleration. The profitability of a code segment for hardware acceleration is determined by using a mathematical model to represent the overheads associated with the placement of data in both the local and main memory subsystems. The proposed approach for hardware acceleration eliminates the need for Direct Memory Access (DMA) transfers or cache-coherence protocols. A scalable technique for the automatic selection of profitable basic blocks for hardware acceleration has been devised in order to overcome the time complexity of the search space. It relies on a heuristic approach to significantly reduce the search space, thereby resulting in a high-spe...

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%