Deep learning methods, especially convolutional neural networks (CNNs), have shown remarkable ability for remote sensing scene classification. However, the traditional training process of standard CNNs only takes the point-wise penalization of the training samples into consideration, which usually makes the learned CNNs sub-optimal especially for remote sensing scenes with large intra-class variance and low inter-class variance. To address this problem, deep metric learning, which incorporates the metric learning into the deep model, is used to maximize the inter-class variance and minimize the intra-class variance for better representation. This work introduces structured metric learning for remote sensing scene representation, a special deep metric learning which can take full advantage of the training batch. However, the deep metrics only consider the pairwise correlation between the training samples, and ignores the classwise correlation from the class view. To take the classwise penalization into consideration, this work defines the center points of the learned features of each class in the training process to represent the class. Through increasing the variance between different center points and decreasing the variance between the learned features from each class and the corresponding center point, the representational ability can be further improved. Therefore, this work develops a novel center-based structured metric learning to take advantage of both the deep metrics and the center points. Finally, joint supervision of the cross-entropy loss and the center-based structured metric learning is developed for the land-use classification in remote sensing. It can joint learn the center points and the deep metrics to take advantage of the point-wise, the pairwise, and the classwise correlation. Experiments are conducted over three real-world remote sensing scene datasets, namely UC Merced Land-Use dataset, Brazilian Coffee Scene dataset, and Google dataset. The classification performance can achieve 97.30%, 91.24%, and 92.04% with the proposed method over the three datasets which are better than other state-of-the-art methods under the same experimental setups. The results demonstrate that the proposed method can improve the representational ability for the remote sensing scenes.