High spatial resolution remote sensing (HSRRS) images contain complex geometrical structures and spatial patterns, and thus HSRRS scene classification has become a significant challenge in the remote sensing community. In recent years, convolutional neural network (CNN)-based methods have attracted tremendous attention and obtained excellent performance in scene classification. However, traditional CNN-based methods focus on processing original red-green-blue (RGB) image-based features or CNN-based single-layer features to achieve the scene representation, and ignore that texture images or each layer of CNNs contain discriminating information. To address the above-mentioned drawbacks, a CaffeNet-based method termed CTFCNN is proposed to effectively explore the discriminating ability of a pre-trained CNN in this paper. At first, the pretrained CNN model is employed as a feature extractor to obtain convolutional features from multiple layers, fully connected (FC) features, and local binary pattern (LBP)-based FC features. Then, a new improved bag-of-view-word (iBoVW) coding method is developed to represent the discriminating information from each convolutional layer. Finally, weighted concatenation is employed to combine different features for classification. Experiments on the UC-Merced dataset and Aerial Image Dataset (AID) demonstrate that the proposed CTFCNN method performs significantly better than some state-of-the-art methods, and the overall accuracy can reach 98.44% and 94.91%, respectively. This indicates that the proposed framework can provide a discriminating description for HSRRS images. 2 of 23 color features, spectral features, and multi-feature fusion [23,24]. However, these hand-crafted features are limited in describing complex scenes of HSRRS, which will affect the classification performance. Compared with the low-level methods, the mid-level methods aim to obtain a global representation of a scene by encoding local descriptors, e.g., scale-invariant feature transform, histogram of oriented gradient and color histogram. The bag-of-view-word (BoVW) model is one of the most popular feature encoding approaches [25]. Due to its simplicity and efficiency, the BoVW model is widely applied for mid-level scene description [26][27][28][29]. However, the quantization error of the BoVW method is large, and some important information may be lost. Therefore, many feature coding methods were developed to reduce the reconstruction error, including improved Fisher kernel (IFK) [30], vectors of locally aggregated descriptors (VLAD) [31], spatial pyramid matching kernel (SPM) [32], locality-constrained linear coding (LLC) [33], latent semantic analysis (LSA), probabilistic latent semantic analysis (pLSA) [34,35], and latent dirichlet allocation (LDA) [35]. However, both low-level and mid-level methods are mainly based on hand-crafted features, which are difficult to effectively describe in HSRRS scene images with complex land-cover/land-use (LULC) situations.In recent years, deep-learning-based methods have made a...