Scene classification of high resolution remote sensing (RS) images has attracted increasing attentions due to its vital role in a wide range of applications. Convolutional Neural Networks (CNNs) have recently been applied on many computer vision tasks and have significantly boosted the performance including imagery scene classification, object detection and so on. However, classification performance heavily relies on the features that can accurately represent the scene of images, thus, how to fully explore the feature learning ability of CNNs is of crucial importance for scene classification. Another problem in CNNs is that it requires a large number of labeled samples, which is impractical in RS image processing. To address these problems, a novel sparse representation-based framework for small-samplesize RS scene classification with deep feature fusion is proposed. Specially, multi-level features are first extracted from different layers of CNNs to fully exploit the feature learning ability of CNNs. Note that the existing well-trained CNNs, e.g., AlexNet, VGGNet, and ResNet50, are used for feature extraction, in which no labeled samples is required. Then, sparse representation-based classification is designed to fuse the multi-level features, which is especially effective when only a small number of training samples are available. Experimental results over two benchmark datasets, e.g., UC-Merced and WHU-RS19, demonstrated that the proposed method can effectively fuse different levels of features learned in CNNs, and clearly outperform several state-of-the-art methods especially with limited training samples.