Taking advantage of the functional complementarity between infrared and visible light sensors imaging, pixel-level real-time image fusion based on infrared and visible light images of different resolutions is a promising strategy for visual enhancement, which has demonstrated tremendous potential for autonomous driving, military reconnaissance, video surveillance, etc. Great progress has been made in this field in recent years, but the fusion speed and quality of visual enhancement are still not satisfactory. Herein, we propose a multi-scale FPGA-based image fusion technology with substantially enhanced visual enhancement capability and fusion speed. Specifically, the source images are first decomposed into three distinct layers using guided filter and saliency detection, which are the detail layer, saliency layer and background layer. Fusion weight map of the saliency layer is subsequently constructed using attention mechanism. Afterwards weight fusion strategy is used for saliency layer fusion and detail layer fusion, while weight average fusion strategy is used for the background layer fusion, followed by the incorporation of image enhancement technology to improve the fused image contrast. Finally, high-level synthesis tool is used to design the hardware circuit. The method in the present study is thoroughly tested on XCZU15EG board, which could not only effectively improve the image enhancement capability in glare and smoke environments, but also achieve fast real-time image fusion with 55FPS for infrared and visible images with a resolution of 640 × 470.