Hyperspectral images have gained great achievements in many fields, but their low-spatial resolution limits the effectiveness in applications. Hyperspectral image superresolution has emerged as a popular research trend, where high-resolution hyperspectral images are obtained via combing low-resolution hyperspectral images with high-resolution multispectral images. In this process of multi-modality data fusion, it is crucial to ensure effective cross-modality information interaction. To generate higher quality fusion results, a crossed dual-branch U-Net is proposed in this paper. In specific, we adopt U-Net architecture and introduce a spectral-spatial feature interaction module to capture cross-modality interaction information between two input images. To narrow the gap between downsampling and upsampling processes, a spectralspatial parallel Transformer is designed as skip connection. This novel design simultaneously learns the long-range dependencies both on spatial and spectral information and provides detailed information for final fusion. In the fusion stage, We adopt a progressive upsampling strategy to refine the generated images. Extensive experiments on several public datasets are conducted to prove the performance of the proposed network. The code can be accessed at https://github.com/liuofficial/CD-UNet.