Recently, many convolutional networks have been built to fuse a low spatial resolution (LR) hyperspectral image (HSI) and a high spatial resolution (HR) multispectral image (MSI) to obtain HR HSIs. However, most deep learning-based methods are supervised methods, which require sufficient HR HSIs for supervised training. Collecting plenty of HR HSIs is laborious and time-consuming. In this paper, a self-supervised spectral-spatial residual network (SSRN) is proposed to alleviate dependence on a mass of HR HSIs. In SSRN, the fusion of HR MSIs and LR HSIs is considered a pixel-wise spectral mapping problem. Firstly, this paper assumes that the spectral mapping between HR MSIs and HR HSIs can be approximated by the spectral mapping between LR MSIs (derived from HR MSIs) and LR HSIs. Secondly, the spectral mapping between LR MSIs and LR HSIs is explored by SSRN. Finally, a self-supervised fine-tuning strategy is proposed to transfer the learned spectral mapping to generate HR HSIs. SSRN does not require HR HSIs as the supervised information in training. Simulated and real hyperspectral databases are utilized to verify the performance of SSRN.