Convolutional Neural Networks (CNNs) consistently proved state-of-the-art results in image Super-resolution (SR), representing an exceptional opportunity for the remote sensing field to extract further information and knowledge from captured data. However, most of the works published in the literature focused on the Single-image Super-resolution problem so far. At present, satellite-based remote sensing platforms offer huge data availability with high temporal resolution and low spatial resolution. In this context, the presented research proposes a novel residual attention model (RAMS) that efficiently tackles the Multi-image Super-resolution task, simultaneously exploiting spatial and temporal correlations to combine multiple images. We introduce the mechanism of visual feature attention with 3D convolutions in order to obtain an aware data fusion and information extraction of the multiple low-resolution images, transcending limitations of the local region of convolutional operations. Moreover, having multiple inputs with the same scene, our representation learning network makes extensive use of nestled residual connections to let flow redundant low-frequency signals and focus the computation on more important high-frequency components. Extensive experimentation and evaluations against other available solutions, either for Single or Multi-image Super-resolution, demonstrated that the proposed deep learning-based solution can be considered state-of-the-art for Multi-image Super-resolution for remote sensing applications.