Fast search block matching algorithm (BMA)-based video coding provides reasonable good quality video with minute cost of computation. In fast BMA, clock cycles required to read pixel data are quite more compared with matching operation due to erratic location of candidate macroblocks (CMBs). With aim of reduction in number of clock cycles, parallel memory system is used in this study, which can accelerate reading of CMBs and speedup motion vector (MV) computation. Novel concept of register array is introduced to organise CMBs, which expedite computation hungry search process. Owing to shape of register array, lesser space is needed to store CMBs and architecture addresses wide range of search patterns. The proposed sum of absolute difference processor with parallel memory system computes MV of 1 macroblock in 28 clock cycles in average case. Compared to single memory system, it saves 68% and 80% clock cycles in CMB access of initial search and intermediate search process, respectively. Hardware architecture is tested with Xilinx Virtex5 field programmable gate array. The proposed fixed 8×8 macroblock size architecture processes 354 high definition (HD) (1080p) frames per second (fps) and configurable architecture processes 201 HD fps which is more than adequate for real-time encoding.