On-board processing has become an inevitable choice for the development of remote sensing satellites. For the satellites with dual-line array asynchronous push-broom mode, image registration is crucial for pixel localization and alignment between dual-band strip data. However, due to the high computational complexity of feature detection and feature extraction, it is still a challenge to realize real-time registration for on-board edge devices with limited computing power and power consumption. Therefore, this paper proposes a hardwareoptimized architecture with high-performance and low-energy for on-board registration based on "ARM+FPGA". On one hand, optimized methods for the image registration algorithm are proposed from both software and hardware perspectives, providing a solution for on-board registration. On the other hand, in the face of the current interface environment of satellites, a design scheme for versatile hardware architecture is proposed. It provides a foundational technical route for hardware deployment of on-board processing. Experimental results show that in this hardware-accelerated architecture, the average acceleration effect of registration algorithm can reach 15 times compared to the unoptimized version. Additionally, the average power consumption is reduced by 60%, and the hardware resource utilization is less than 40%. Importantly, the algorithm's accuracy remains unaffected by these optimizations. The onboard intelligent processing payload deployed with this hardware architecture has been successfully launched and validated in February 2022 in the orbit of the MN200Sar-1 satellite from China. It improves the real-time capability for on-board processing, and aims to achieve a new imaging mechanism where target and information transmission replace strip image transmission. Index Terms-Field Programmable Gate Array (FPGA), Hardware acceleration, On-board processing, Speed-up Robust Features (SURF), High-Level Synthesis (HLS).