Compared with a single image, in a complex environment, image fusion can utilize the complementary information provided by multiple sensors to significantly improve the image clarity and the information, more accurate, reliable, comprehensive access to target and scene information. It is widely used in military and civil fields, such as remote sensing, medicine, security and other fields. In this paper, we propose an end-to-end fusion framework based on structural similarity preserving GAN (SSP-GAN) to learn a mapping of the fusion tasks for visible and infrared images. Specifically, on the one hand, for making the fusion image natural and conforming to visual habits, structure similarity is introduced to guide the generator network produce abundant texture structure information. On the other hand, to fully take advantage of shallow detail information and deep semantic information for achieving feature reuse, we redesign the network architecture of multi-modal image fusion meticulously. Finally, a wide range of experiments on real infrared and visible TNO dataset and RoadScene dataset prove the superior performance of the proposed approach in terms of accuracy and visual. In particular, compared with the best results of other seven algorithms, our model has improved entropy, edge information transfer factor, multi-scale structural similarity and other evaluation metrics, respectively, by 3.05%, 2.4% and 0.7% on TNO dataset. And our model has also improved by 0.7%, 2.82% and 1.1% on RoadScene dataset.