Using the coarser operand grain and simplified interconnection patterns, CGRA (coarse grained reconfigurable architectures) has been proven to be energy efficient in several specific domains. As we know, the speed at which the contexts are applied to a PEA (processing element array) directly determines the performance of CGRA. In this paper, the design space in CGRA is further developed from the configuration granularity perspective by one middle-grained configuration granularity-the row-based configuration mechanism (RCM). The most prominent feature of the RCM is that a large DFG (data flow graph) can be mapped onto a small array in once reconfiguration, which is carried out on a row-by-row basis. Compared with an ordinary DFGpartitioning solution, the reconfiguration time and the data transfer time are well reduced. Furthermore, the proposed RCM offers much more efficient storage for the contexts. Compared with the DFG partitioning solution, the performance is boosted from 2.6% to 57.8%, while the area penalty is only 4.79% and the power penalty is only 7.22%. The RCM has been used in one reconfigurable processor called REMUS HPA (reconfigurable multi-media system, high performance version advanced). REMUS HPA has been implemented on a 50.5 mm 2 silicon with TSMC 65 nm technology. Simulation shows that 1920×1088@37 fps can be achieved for H.264 high-profile decoding when exploiting a 200 MHz working frequency. Compared with the high performance version of XPP (one commercial reconfigurable processor), the performance is 247% boosted.