It is well known that innermost loop optimizations have a big effect on the total execution time. Although CGRAs is widely used for this type of optimizations, their usage at run-time has been limited due to the overheads introduced by application analysis, code transformation, and reconfiguration. These steps are normally performed during compile time. In this work, we present the first dynamic translation technique for the modulo scheduling approach that can convert binary code on-the-fly to run on a CGRA. The proposed mechanism ensures software compatibility as it supports different source ISAs. As proof of concept of scaling, a change in the memory bandwidth has been evaluated (from one memory access per cycle to two memory accesses per cycle). Moreover, a comparison to the state-of-the-art static compiler-based approaches for inner loop accelerators has been done by using CGRA and VLIW as target architectures. Additionally, to measure area and performance, the proposed CGRA was prototyped on a FPGA. The area comparisons show that crossbar CGRA (with 16 processing elements) is 1.9x larger than the VLIW 4-issue and 1.3x smaller than a VLIW 8-issue softcore processor, respectively. In addition, it reaches an overall speedup factor of 2.17x and 2.0x in comparison to the 4 and 8-issue, respectively. Our results also demonstrate that the runtime algorithm can reach a near-optimal ILP rate, better than an off-line compiler approach for an n-issue VLIW processor.