Recent advances have shown the great power of deep convolutional neural networks (CNN) to learn the relationship between low and high-resolution image patches. However, these methods only take a single-scale image as input and require large amount of data to train without the risk of overfitting. In this paper, we tackle the problem of multi-modal spectral image super-resolution while constraining ourselves to a small dataset. We propose the use of different modalities to improve the performance of neural networks on the spectral superresolution problem. First, we use multiple downscaled versions of the same image to infer a better high-resolution image for training, we refer to these inputs as a multi-scale modality. Furthermore, color images are usually taken at a higher resolution than spectral images, so we make use of color images as another modality to improve the super-resolution network. By combining both modalities, we build a pipeline that learns to super-resolve using multi-scale spectral inputs guided by a color image. Finally, we validate our method and show that it is economic in terms of parameters and computation time, while still producing state-of-the-art results. 1