Hyperspectral imaging has become a mature technology which brings exciting possibilities in various domains, including satellite image analysis. However, the high dimensionality and volume of such imagery is a serious problem which needs to be faced in Earth Observation applications, where efficient acquisition, transfer and storage of hyperspectral images are key factors. To reduce the time (and ultimately cost) of transferring hyperspectral data from a satellite back to Earth, various band selection algorithms have been proposed. They are built upon the observation that for a vast number of applications only a subset of all bands convey the important information about the underlying material, hence we can safely decrease the data dimensionality without deteriorating the performance of hyperspectral classification and segmentation techniques. In this paper, we introduce a novel algorithm for hyperspectral band selection that couples new attention-based convolutional neural networks used to weight the bands according to their importance with an anomaly detection technique which is exploited for selecting the most important bands. The proposed attention-based approach is data-driven, re-uses convolutional activations at different depths of a deep architecture, identifying the most informative regions of the spectrum. Also, it is modular, easy to implement, seamlessly applicable to any convolutional network, and can be trained end-to-end using gradient descent. Our rigorous experiments, performed over benchmark sets and backed up with statistical tests, showed that the deep models equipped with the attention mechanism are competitive with the stateof-the-art band selection techniques and can work orders or magnitude faster, they deliver high-quality classification, and consistently identify significant bands in the training data, permitting the creation of refined and extremely compact sets that retain the most meaningful features. Also, the attention modules do not deteriorate the classification abilities, and slow down neither training nor inference of the deep models.