It is well known that innermost loop optimizations have a big effect on the total execution time. Although CGRAs is widely used for this type of optimizations, their usage at run-time has been limited due to the overheads introduced by application analysis, code transformation, and reconfiguration. These steps are normally performed during compile time. In this work, we present the first dynamic translation technique for the modulo scheduling approach that can convert binary code on-the-fly to run on a CGRA. The proposed mechanism ensures software compatibility as it supports different source ISAs. As proof of concept of scaling, a change in the memory bandwidth has been evaluated (from one memory access per cycle to two memory accesses per cycle). Moreover, a comparison to the state-of-the-art static compiler-based approaches for inner loop accelerators has been done by using CGRA and VLIW as target architectures. Additionally, to measure area and performance, the proposed CGRA was prototyped on a FPGA. The area comparisons show that crossbar CGRA (with 16 processing elements) is 1.9x larger than the VLIW 4-issue and 1.3x smaller than a VLIW 8-issue softcore processor, respectively. In addition, it reaches an overall speedup factor of 2.17x and 2.0x in comparison to the 4 and 8-issue, respectively. Our results also demonstrate that the runtime algorithm can reach a near-optimal ILP rate, better than an off-line compiler approach for an n-issue VLIW processor.
The constant advances on scaling have introduced several issues to the design of processing structures in new technologies. The closer one gets to nano-scale devices, the more necessary are methods to develop circuits that are able to tolerate high defect densities. At the same time, beyond area costs, there is a pressure to maintain energy and power dissipation at acceptable levels, which practically forbids classical redundancy. This paper presents a dynamic solution to provide reliability and reduce energy of a microprocessor using a dynamically adaptive reconfigurable fabric. The approach combines the binary translation mechanism with the sleep transistor technique to ensure graceful degradation for software applications, while at the same time can reduce energy by shutting off the power supply of the unused and the defective resources of a reconfigurable fabric.
Several aspects limit the scalability of parallel applications, e.g., off-chip bus saturation and data synchronization. Moreover, the high cost of cooling HPC systems, which can outweigh the cost of developing the system itself, has pushed the parallel application’s execution to another level of requirements, in terms of performance and energy. In this work, we propose AtTune: a heuristic-based framework for tuning the number of processes/threads and CPU frequency to optimize the parallel applications’ execution. AtTune is transparent for the user, independent of the input size, and it optimizes for different parallel programming models. We evaluated our proposed solution considering five well-known kernels implemented in MPI and OpenMP. Experimental results on two real multi-core systems showed that AtTune improves up to 36%, 11%, and 32% the energy efficiency, performance, and Energy-Delay Product, respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.