Deep convolutional neural network (CNN) models are typically trained on high-resolution images. When we apply them directly to low-resolution infrared images, for example, the performances will not always be satisfactory. This is due to CNN layers that operate in a local neighborhood, which is already poor in information for infrared images. To overcome these weaknesses and increase information of global nature, a hybrid architecture based on CNN with self-attention mechanism is proposed. This later provides information about the global context by capturing the long-range interactions between the different parts of an image. In this paper, we have incorporated a convolutional–attentional form in the top layers of two pre-trained networks VGGNet and ResNet. The convolutional–attentional form is a concatenation of two paths; the original convolutional feature maps of the pre-trained network, and the output of a relative multi-head attentional block. Extensive experiments are conducted in the FLIR starter thermal dataset, where we achieve a [Formula: see text] overall accuracy in the four-class FLIR starter thermal dataset. Moreover, the proposed architectures exceed the state of the art in target recognition on two-class FLIR starter thermal dataset with a [Formula: see text] improvement in overall classification accuracy. In addition, a study on the effect of different hyper-parameters and error analysis is carried out to give some research forward directions.